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M any of us eagerly look forward each year to receiving the 
October supplement to Academic Medicine (now available 
online at (http://www.academicmedicine.org)). We know 
that it will contain research reports and reviews that expand our 
knowledge and inspire us to encourage others to undertake the study 
of important questions as well as to continue our own research pro- 
grams. The proceedings of the 39th annual conference on Research 
in Medicine Education (RIME) is no exception. The theme of this 
year’s meeting, Making a Difference, is truly the goal of medical edu- 
cation research, and many facets of this goal are reflected in the 
papers that constitute the program for this meeting. 

The purpose of the RIME conference is to provide a forum for 
the presentation and discussion of research concerning all aspects 
of medical education. The annual meeting of the Association of 
American Medical Colleges represents the largest gathering of the 
academic medicine community' in the world. The meeting provides 
an opportunity for medical education researchers to demonstrate the 
vigor and diversity of scholarly investigations in medical education. 

I am exceptionally pleased that Dr. Whitney Addington, director 
of the Rush Primary Care Institute and professor of Family Medicine, 
Internal Medicine, and Nursing, will give the 2000 RIME confer- 
ence’s invited address, entitled “Lifelong Learning: When Are We 
Going to Practice What We Preach?” Dr. Addington has promised 
to be provocative in his advocacy for continuous professional de- 
velopment. As the outgoing president of the American College of 
Physicians, he is in a unique position to reflect on this issue and chal- 
lenge us to increase our inquiry on how best to accomplish the goals 
of lifelong learning. Dr. Addington is president of the Chicago Board 
of Health and has been a strong campaigner for universal health 
insurance in this country. 1 am delighted he has agreed to address us 
and I look forward to his comments. 

An increased number of research papers were submitted to the 
RIME committee this year, another indication of the health of med- 
ical education research. Of the 98 research papers submitted, 36 were 
selected for presentation and inclusion in the proceedings. The top- 
ics range from the tried and true, such as standardized patient and 
OSCE examinations and related measurement issues, to some that 
have had limited discussion, such as telemedicine and Web-based 
education, the role of hospitalists, and the impact of the marketplace 
on academic medicine, ^ch paper is scheduled, along with one or 
two other papers on a similar topic, to a session led by a moderator, 
who has the opportunity to raise pertinent questions and erasure a 
wide-ranging discussion by the audience. In an effort to highlight 
topics that may be of special interest, six of the research papers were 
selected for presentation in two plenary sessions. This year, some of 
our outstanding researchers will be joined by a number of new faces. 
Especially gratifying is that several of these new faces are those of res- 
ident physicians and medical students, a further indication of the 
vitality of medical education research. 

In addition to the research papers, there were 220 abstract propos- 
als received, 96 of which were selected for presentation as either 
posters or oral presentations. As in the past, the RIME conference 
will host a reception to showcase the abstracts in the poster format 
on Monday evening. The oral abstracts will be presented in sessions 
organized around specific topics, each guided by a moderator. 



This supplement to Academic Medicine includes, in addition to the 
research papers, two review papers and the invited address from 1999. 
Dr. Zubair Amin will review issues related to a common occurrence 
in residency education, the morning report, and point our areas chat 
need to be the focus of research. The other review, presented by Dr. 
Shiphra Ginsburg, will address the challenge of evaluating profes- 
sionalism and explore the relationships among context, conflict, and 
resolution in this endeavor. In 1999, Dr. Charles Friedman, from the 
University of Pittsburgh, presented a thought-provoking address on 
the role of informatics and information technology in medical educa- 
tion. His informative presentation, entitled “The Marvelous Medical 
Education Machine,” is published in these proceedings. 

This year, for the first time, the supplement also includes the 1999 
Jack Maatsch Memorial Presentation, sponsored by the Office of 
Medical Education Research and Development at Michigan State 
University. “The Epistemology of Clinical Reasoning: Perspectives 
from Philosophy, Psychology, and Neuroscience,” by Dr. Geoffrey 
R. Nonnan of McMaster Uni ersity, draws on differing disciplines 
to expand the subject of clinical reasoning and its avenues for fur- 
ther research. It is accompanied by “Clinical Problem Solving and 
Decision Psychology,” by Dr. Arthur Elstein of the University of 
Illinois at Chicago. 

Two excellent symposia were selected for presentation, one, mod- 
erated by Dr. Deborah Simpson, on the challenging task of measur- 
ing faculty development outcomes, and the other, moderated by 
Dr. David New'ble, on assessing the performances of practicing 
physicians. The always-popular “RIME wrap-up” session will round 
out the formal presentations. We are fortunate to have persuaded 
Dr. John Bligh, of The University of Liverpool, and editor of Medical 
Education (UK); Dr. Georges Bordage, University of Illinois at 
Chicago; and Dr. Judy Shea, University of Pennsylvania, to help us 
put the meeting's presentations in perspective and suggest where we 
might go from here 

The RIME conference is planned and organized by the RIME com- 
mittee, a committee of the AAMC’s Group on Educational Affairs. 
On behalf of the committee, we wish to express our appreciation to 
all researchers who submitted papers for the meeting. It was not an 
easy task to select those to place on the program, and we are indebted 
to the essential contribution made by the external reviewers. These 
individuals provided suggestions artd comments that have benefited 
the authors of the papers, symposia, and abstracts that make up this 
year’s meeting. 

The RIME committee wants to recognize the outstanding contri- 
bution of Brownie (M. Brownell) Anderson, Associate Vice- 
President, Division of Medical Education of the AAMC. I know the 
committee members join me in saying that the process of requesting 
and reviewing papers and developing the program was made infinitely 
easier by her guidance and wisdom, over and above her welcome 
sense of humor. Her staff, as well, eased our task. In addition, we 
appreciate the support of Addeane Caelleigh, editor of Academic 
Medicine, and her able staff for their assistance in publishing these 
proceedings. We hope you enjoy the meeting. 

Beth Dawson 

Chair, 2000 RIME Ck>mmittee 
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Morning Report: Focus and Methods over the Past Three Decades 

2UBA1R AMIN, JESUS GUAJARDO, WLOD21M1ERZ WISNIEWSKI, GEORGES BORDAGE, 
ARA TEKiAN, and LEO G. NIEDERMAN 



Residents rank morning report as the most important educational 
activity of their residency training.' Although there is a lack of 
documented e\’idence as to the educational value of morning re- 
port, the practice is ubiquitous across almost all primary care resi- 
dency programs in North America. The ever^changing practice of 
medicine and ongoing demands for evidence in medical education 
force us to examine essential aspects of morning report in order to 
base future decisions about morning report on sound educational 
evidence. Thus, a systematic review of the published Ijrerature on 
morning report was done in order to identify the various purposes 
and modalities of morning report, to find evidence in support of its 
educational value, and to discu.ss possible future directions for re- 
search on morning report. 

The term “morning report” is used to describe case-based con- 
ferences where residents, attending physicians, and others meet to 
present and discuss clinical cases. Tlie term includes resident re- 
ports, morning or housestaff conferences, and morning sessions but 
excludes work rounds or teaching rounds. In a tVTlcal morning 
report, the team on duty during the night presents recently admit- 
ted patients, followed by a general discussion of the cases and re- 
lated topics. 

Data Collection 

Data Identification and Study Selection. Four complementary ap- 
proaches were used to locate articles about morning report. Tlie 
goal was to retrieve all published articles. First, Medline, ERIC, and 
PsycINFO were searched using the key words morning repetrt, 
morning session, residents’ report, morning conference, education, 
and teaching. The key words were used in various combinations 
and in different search modes (e.g., titles and subject headings). 
The search covered articles written between 1966 (start of Med- 
line) and December 1999. No limitation was set on the search 
parameters. All journals, languages, and types of articles, including 
original articles, surveys, opinions, and letters to the editor, were 
included. Second, a manual search was conducted through non- 
indexed medical education journals. All relevant articles not pre- 
viously identified by computerized searches were included. Third, 
the reference section of each article was reviewed and nil pertinent 
articles not previously found were also retrieved and included for 
review. Finally, knowledgeable educators in the field were consulted 
in an effort to locate any additional articles not previously detected. 
As a result, 48 articles were found related to morning report. Al- 
though the search began with articles dating back to 1966, the 
oldest article on morning report was published in 1979. Most ar- 
ticles (80%) were published after 1990. Forty-one articles ate dis- 
cussed; seven other articles, mostly letters to the editor, addressed 
issues already covered elsewhere.’’^ 

Data Extractioru The selected articles were reviewed according 
to a three-step method as described by Gordon,” namely identifi- 
cation of key issues for revie;/, selection of relevant information 
from various articles related to each issue, and critical #nihesis and 
generalizations. The focus was primarily on the cduJacronal aspects 
of morning rept>rt and areas of possible improvement. We identified 



four major areas for re\’iew: purpose of morning report, organiza- 
tion, instructional methods, and educational outcomes. Eiich topic 
area is presented, followed by an overall discussion at the end. 

Purpose of Morning Report 

Historically, morning report probably was created to meet the de- 
mands of the hierarchical systems of public hospitals. In many ca.ses, 
there were no ward attendings, and the chief of service had to 
ensure the health and safety of all the patients. Morning report 
provided the chief of service with the information needed to 
achieve this level of oversight.'® Both the purpose and the audience 
of morning report have evolved over the years, and morning report 
is now conducted for diverse purposes with a wide variety of au- 
diences. The various purposes were evident in the literature re- 
viewed, with education becoming the main objective.*' Other pur- 
poses were also mentioned, such as evaluating residents and the 
quality of services, detecting adverse events, and social interaction. 
The multiple purposes were evident in Patrino and Villanueva’s 
survey of faculty and chief residents from 124 departments of med- 
icine. Half of the respondents considered morning report “an im- 
portant case-oriented teaching session” and a fifth believed that 
morning report “allow[cd] the chief of medicine or program director 
to keep tabs on medical ser\'ice.s.”" The importance of education 
was also reiterated in a recent survey where the majority of internal 
medicine residents indicated that education should be the primary- 
purpose of morning report.’’ The various purposes of morning re- 
port are presented according to five subheadings: education, eval- 
uation of residents and quality of services, detection and reponing 
of adverse events, non-medical issues, and social interaction. 

Education. The educational goals pursued during morning report 
varied widely, ranging from case-based teaching' to reviewing 
and planning patient management,'''’*'' fostering presentation 
skills,'’"^ highlighting the unique approach of the generalist phy- 
sician,'” developing intellectual curiosity and research,” '” promot- 
ing decision-making skills,^® and self-directed learning.’®*' Morning 
report was also used to teach residents selected topics that are not 
usually part of the curricutum, such as ethics.^* Case -oriented teach- 
ing was the most frequently cited educational purpose of morning 
reports." 

Evaluation of Residents and Quality of Services. Most of the pro- 
grams surv'cyed used morning report as a mean of evaluating resi- 
dents’ performances."’” In Parrino and Villanueva’s survey, faculty 
in many programs used morning report to evaluate residents’ atti- 
tudes (84%), clinical skills (63%), and quality of care (93%)." A 
majority of respondents (82%) reported that morning report was 
also an effective means of ca.se management." Although morning 
report was used to evaluate residents and quality of care, no stnic- 
rured instrument or raring scale to conduct such evaluations was 
reported. 

Detection and Reporting of Adverse Event. Morning report was 
sometimes used to detect and report adverse events.” ” Kaufmann 
reported that a pharmacy intern regularly attended morning report 
and considered whether admissions were related to medication 
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problems/'’ Sivaram et al. reported that adverse drug reactions were 
discussed in the business portion of morning report and were later 
reviewed by the Pharmacy and Therapeutic Committee.*^ Welsh et 
al. explored the effect of prompting residents to report adverse 
events.^’’ All three studies concluded that morning report can be 
an effective means to detect and report adverse events such as drug 
reactions. 

Non'T7ieclfca/ issues. Although the discussion of non-medical iS' 
sues during morning report was seldom reported, most programs 
addressed these issues on a regular basis. Schiffman et al. found that 
85% of programs addressed a variety of non-* medical issue.s such as 
social, personal, ethical, political, and economic topics, as well as 
cost-effectiveness and administrative matters."’ Actual time spent 
on these issues during morning report was not reported. 

Soda/ Imeracfion. Although social interaction was not an explic- 
itly stated goal, morning report provided an opportunity for resi- 
dents and faculty to socialize. Eighty-five percent of the respondents 
in Parrino and Villanueva’s survey indicated that morning report 
was an important social event for both residents and faculty.” Two 
thirds of the programs in Schiffmanns study serv’cd food and drinks 
during morning report and conducted business in an informal at- 
mosphere that fostered social interaction.^' 

In summary', residency programs used morning report for multiple 
purposes, including education and a variety of other goals. Resi- 
dents favor morning report as an educational activity. The relative 
importance of each purpose of morning report depends on individ- 
ual programs and, in turn, may determine the way morning report 
will be organized and conducted. 

Organization of Morning Report 

Most of the articles that addressed the organizational aspects of 
morning report came from internal medicine residency programs. 
Other programs included pediatrics, family medicine, and neurol- 
ogy-. The organization of morning report is presented according to 
five subheadings: frequency, time, and duration; participation, lead- 
ership, and tone; case selection and presentation; record keeping; 
and patient follow up. 

frequency, Time, and Duration. The frequency of morning report 
was fairly uniform across programs. Most were held on a regularly 
scheduled basis, with 80% of internal medicine programs holding 
morning report five rimes or more a week. Only a handful of pro- 
grams held morning report less than three times a week."' Morning 
repoit usually began before 9 AM and lasted for an hour.’' Some 
programs (4%) actually held "morning” report during the after- 
noon."' In most programs, work rounds preceded morning report to 
facilitate data collection prior to morning report. Schiffman et al. 
argued that conducting morning report after ward rounds may be 
more useful because attending physicians can contribute signifi- 
cantly to the quality of the session."' 

Participants, Leadership, and Tone. The mix of participants and 
leaders varied greatly across programs. The chief of medicine or the 
director of medical education was present in more than half of the 
sessions.*' Third-year service residents were the most regular par- 
ticipants, while the presence of first-year residents varied, with 
about 60% of the programs requiring their participation on a reg- 
ular basis."' Gross cr al. reported that internal medicine residents 
prefer the presence of generalist physicians at morning report, pos- 
sibly because of the renewed interest in general internal medicine.'" 
Carruthers described an Australian program where general practi- 
tioners from the community regularly attended morning report. She 
argued that a more widespread participation of general practitioners 
during morning report would lead to a better understanding of the 
strengths and weaknesses of general practice."*’ Finally, the presence 
of non-physician participants helped to broaden the scope of 
knowledge and experience of the residcnr.s. For example, pharma- 
Lists increased the detection of adverse drug reactions'^ ’ ' and li- 



brarians increased the use of online searches by residents.” Some 
have argued against the presence of non-service personnel, junior 
residents, or medical students at morning report because their pres- 
ence might irvhibit the spontaneity of case presentation and dis- 
cussion."' 

Studies of verbal interactions during morning reporr consistently 
showed that panicipants tend to be rigid in their roles and in their 
ways of asking for or providing information. Most of the informa- 
tion exchanged was low-level factual information. Few questions 
were asked that required synthesis of patient information and med- 
ical knowledge." 

The person leading morning report was either a faculty member 
(70%) or a chief resident (30%).” Many openly criticized the role 
of the leaders and the tone they .set during morning report.''"^*''^ ” ” 
Comments such as "morning retort or morning distort,” “where 
bottom line is style above substance,”” and “secretive closed-door 
session”’" were reported frequently. McGaghie et al. described the 
menacing atmosphere that prevailed in one institution as . . 
houscstaff defining and defending mishaps using mechanisms such 
as denials, discounting, and distancing.”’’ 

Cose Se/ecf!07i and Presentation. The selection and mode of pre- 
sentation of cases also varied greatly among programs, reflecting 
most often the chief resident’s or attending physician’s prefer- 
ences/' Case presentations varied from brief presentations of all 
cases with equal emphasis on each case to elaborate presentations 
of one or two "interesting” cases. Accordingly, times allotted for 
each case pre,sentarion varied widely. Weseman prospectively com- 
pared the nature of the cases presented in internal medicine at a 
university center with those at an affiliated Veterans Administra- 
tion hospital. The case mixes were similar in the two institutions; 
most cases (88%) were those of inpatients.” Gerard ct al. reported 
that pediatrics residents were more likely to select cases whose di- 
agnosis changed during hospitalization.” Other unorthodox meth- 
ods of case selection and presentation included the selection of 
cases one to two days in advance,’^ the selection of simple cases at 
the beginning of the academic year and more complex ones later 
in the year,’’ and the presentation of cases prior to discharge/’’’^ '^ 

Record Keeping. Record keeping was done for different purposes 
during morning report. Records were kept for educa- 
tional purposes, such as the evaluation of c^ uent coverage” and 
patient follow ups,'^ or as data sources for research.” The avail- 
ability of computers enabled many programs to use the data from 
morning report for a variety of purposes. Rouan et al, described a 
computer program to generate information from hospital admis- 
sions. Tliey used the information for patient follow up, patient dis- 
tribution among housestaff, residents’ evaluation, and quality as- 
surance.” Recht et al. also described a computerized data 
management program and its use in clinical research and quality 
assurance.” 

Patient Follow Up. Most internal medicine programs allowed for 
patient follow ups."' Wegner and Shpincr showed that a final di- 
agnosis was not always available at the time of discharge.'*^ Simi- 
larly, Barton ct al. compared pediatrics morning reports from a com- 
munity hospital and a university hospital. In both settings, 
significant numbers of patients, 28% and 58%, respectively, were 
not diagnosed at the time of presentation at morning report.’’’ Both 
investigators concluded chat provision of patient follow up in 
morning report was important to maximize education. 

In summary, there was a fair amount of regularity and similarity 
among programs in the frequency, time, and duration of morning 
report. There was more variability in the mix of participants and 
leaders, case selection, record keeping, and ciiticnt follow up. Many 
openly criticized the type of leadership used m conducting morning 
rcp<.:>rt. There was a lack of evidence in the literature on how the 
different purpo.ses of morning report might affect its organization 
and the educational and clinical outcomes. 
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Instructional Methods 

The most frequent instructional method used during morning re- 
port was case-based presentation, followed by discussion. Over three 
fourths of the programs surveyed by Malone and Jackson used such 
an approach."^^ Variations of case-based presentations were also used 
in an effort to improve educational effectiveness. For example, the 
chairman and chief resident would meet prior to morning report 
to review cases and preselect critical points for discussion.*^ The 
limitations of case-based presentations were also discu:«ed in the 
literature, most notably by Parrino and Villanueva,'* Mehler et al.,'" 
and Hill et ah'*’ Mehler et al. argued that “the standard format of 
case presentation may be less than optimal and can become a hack- 
neyed experience.”*** Some shortcomings of case-based presenta- 
tions have been addressed through innovative methods such as the 
presentation of prepared topics, photographic materials,**^ and 
learner-centered learning approaches.’*^ In learner-centered ap- 
proaches, the residents would determine the goals of the session 
once the cases were presented and then formulate questions based 
on these goals.**^ Parrino and Villanueva further proposed that “new 
techniques at morning report could be based on existing models of 
problem-based learning.”** Battinelli echoed this view and advised 
learners to be creative and try new approaches.**^ 

Like medical education, morning report faces a dilemma over its 
educational focus. Two main orientations emerged from the review. 
One focused on the need to increase the residents’ knowledge level, 
the other on the need to improve their problem-solving and data- 
gathering skills. DeGroot and Siegler dc.scribed the dilemma by 
using the analogy of the retentive “sponge mode” versus the in- 
quisitive “search mode.”*** Years later, Richardson and Smith revis- 
ited this issue and reemphasized the importance of learning the 
process of information gathering and analysis rather than simply 
acquiring content knowledge.**’ Reilly and Lemon described a four- 
phase (similar to evidence-based medicine) morning report to foster 
active learning.^* The first phase was devoted to the discussion of 
assigned questions from the previous day. Next, residents briefly 
presented all admission cases and the chief resident used didactic 
methods to emphasize imporrant teaching issues. The participants 
then discussed in detail one particular case chosen for its educa- 
tional value. Finally, the last five minutes were spent on formulat- 
ing questions and assigning them to residents for presentation the 
nexr day. Reilly and Lemon reported a department-wide, positive 
impact following the introduction of this format. In addition, res- 
idents learned the principles and procedures of evidence-based 
medicine and how to formulate precise and clinically relevant ques- 
tions. 

Educational Outcomes 

In an era of evidence-based medicine, evidence is also needed in 
education to enlighten existing educational practices and to plan 
new ones. Half of the 48 articles on morning report (52%) were 
based on studies. Surveys and questionnaires were used most often 
to collect data (nine studies); other data-gathering methods were 
observations, video recordings, quizzes, logbooks, and hospital 
records. Most studies were based on single programs; only four were 
conducted with multiple programs."'*’ *^*' Some articles were based 
on anecdotal reports without any detailed data presented. 

Wartman stated that derailed discussions, chart reviews, and 
analysis of hospital bills of selected discharged patients resulted in 
subsequent reductions in lengths of stay and controllable costs.'' '’ 
Similarly, Mehler et al. described a model of morning report that 
resulted in less test ordering and fewer requests for consults.*** They 
reported that the participants' level of enthusiasm declined during 
the academic year and that more in-depth discussion of single cases 
became more attractive as ttmc went on. Bassiri et al. introduced 
changes in morning report — such as presentation of articles, com- 



ments by specialists, a computer database* and regular followups — 
that improved the level of discussion and generated data for re- 
search.'^ Potyk et al. reported that both quizzes and mini-lectures 
increased learning, as measured hy a true-false test administered 
later, although the quiz format resulted in better information re- 
tention.**^ D’Allessandro and D’Allessandro reported the use of ra- 
diology slides at pediatrics morning report as a means of increasing 
residents’ interest.**'"' Finally, several authors reported that morning 
report covered a broad range of topics included in published cur- 
ricula (e.g., Pediatrics Review and Education Program by the Amer- 
ican Academy of Pediatrics)**'^ and in major medical references (e.g., 
internal medicine textbooks).'* All programs that implemented in- 
novations reported positive results as measured hy increases in res- 
idents’ knowledge"*' or desired behaviors.' '‘** ’^ 

Discussion 

Some key findings emerged from the di\-erse, albeit limited number 
of, publications on morning report (48 articles over 20 years). First, 
the purposes of morning report varied widely, although education 
was most frequently cited and favored hy residents. Other important 
purposes were also mentioned, such as patient management and 
program and resident evaluation. Second, certain characteristics of 
rhe organization of morning report, such as frequency, timing, and 
duration, were fairly similar across programs. On the other hand, 
mix of participants, case selection and presentation, leadership, rec- 
ord keeping, and patient followup varied widely across programs. 
Tone, leadership, and the learning environment were often criti- 
cized. TTird, various interventions that were implemented to im- 
prove the educational and clinical outcomes of morning report gen- 
erally resulted in positive and promising results, although further 
validation of these findings is needed. Fourth, most of the published 
studies were from single programs, especially in internal medicine. 
There were very few studies on medical students and morning re- 
port. Encouragingly, there is renewed interest in morning report as 
an educational activity, as evidenced hy the steady growth of pub- 
lished articles during the past decade. 

The limited evidence available on morning report makes it dif- 
ficult to make grounded recommendations, but some of the models 
used to plan and implement morning report were based on sound 
educational principles. For example, Reilly and Lemon’s im)del of 
morning report is unique in that it encourages active learning, 
maintains continuity, and Improves research activities in the pro- 
gram.***’ Such theor>'-bascd models can serve as the foundation on 
which to develop sound educational interventions that can be sub- 
mitted to the scrutiny of rhe educational researchers. There is a 
clear lack of studies to document the effectiveness of morning re- 
port. This paucity may be due to the difficulties of doing research 
in the context of a multifaceted and multifactorial situation such 
as the multiple purposes, organizations, and audiences involved in 
morning report. It is also difficult to isolate the effects of morning 
report from those of other formal and informal educational activ- 
ities. Finally, the lack of validated assessment instruments also adds 
to the difficulty of doing re.search on morning report. These diffi- 
culties should not be .seen as insurmountable obstacles but ;is chal- 
lenges to be met. 

Future research is needed in four key areas. First* there is a need 
to characterize the types of learning and teaching that go on during 
morning report. Wliat arc the unique teaching and learning char- 
acteristics of morning report compared with otlrer educational ac- 
tivities such as work rounds or teaching rounds? Second, little is 
known about the satisfaction levels of participants and the moti- 
vational factors that arc operative during morning report. Although 
residents value morning report as their most important day-to-day 
learning activity, they also harbor strong negative feelings about 
the atmosphere that prevails. Could the quality of morning report 
be enhanced hy analyzing more closely the positive and negative 




feelings of the residents and the faculty? Research is also needed to 
document the effects of morning report on residents’ knowledge, 
behaviors, and attitudes, as well as on patients’ health care out- 
comes. Finally, there is a need for multi-institutional research on 
the effectiveness of new strategies to conduct morning report in 
order to verify’ the robustness of the inter\’entions and thus move 
beyond program^specific effects. 

Although the main focus of morning report has been on inpa- 
tient topics, there is a need to address the specifics of morning 
report in the context of ambulatory care. The pioneering work by- 
Malone and Jackson indicated that the educational characteristics 
of ambulatory morning reports are significantly different from those 
of inpatient morning reports.'*''" Consequently, simple generaliiation 
of results from inpatient modalities to ambulatory care is not rec- 
ommended. Ambulatory’ morning report is relatively new and offers 
ample opportunities for high-quality research, including the iden- 
tification of the specific learning needs of the participants. What 
are the unique components of the residents’ education that should 
and can be addressed during ambulatory’ morning report? What are 
the unique educational attributes of ambulatory- morning report? 
How can the continuity between ambulatory morning report and 
inpatient morning report best be ensured? Other priority research 
areas include studies of the natures of the cases presented and their 
relationships to educational and clinical outcomes. 

Tlic majority of studies on morning report came from internal 
medicine programs, with only a handful of reports from pediatrics, 
family medicine, and surgery-. There is a need to plan studies across 
specialties to inform one another about the effectiveness of the 
innovations. Although morning report is primarily focused on res- 
idents, there arc other important participants present during morn- 
ing report, such as medical students, ethicists, and pharmacists. 
There was little focus in the literature on the participation of these 
types of participants during morning report. The educational needs 
and learning characteristics of this diverse audience arc different 
from those of residents and need to be studied as well. 

Morning report is a time-honored tradition. It is nor jusr a ritual 
of early morning scKial gathering or a one-stop opportunity for pro- 
gram directors to keep tabs on the program. It is a valued time for 
residents, an uninterrupted flow of priceless minutes set aside from 
the hectic morning schedule for learning. Morning report is an 
opportunity for residents to exercise and improve their knowledge 
and their leadership, presentation, and problem-solving skills. Yet 
reports of its educational effectiveness are mostly anecdotal and its 
purpose often implicit or not explicitly defined. Each individual 
program must decide what it wants to achieve with morning report 
and structure the activity accordingly, distinguishing it from sitting 
rounds or patient-management rounds. Research is needed to doc- 
ument the educational and clinical effectiveness of morning report 
and to assess the relative merits of \'arious ways of conducting 
morning report such that evidence and tradition can go hand in 
hand. 

Thi> review wa.s done as p.irr of an Independent Siudy while L^s. .^min, Guajardo, 
and Wisniewski complcred a masrers' decrees in healih pri)fc*>Mnns education in rhe 
LV‘partiiK'nt of Mc^lital &lucation at the University of Illinois ai Chicayr* and were 
fellow's in the Intcmarinna! Educational Partnership in Pediarncs program jointly ad- 
ministered by the Department of Pediatric^ and the Department of Mc<hcal Education. 
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During medical school, students are taught the knowledge, skills, 
and attitudes required to become competent physicians. Knowledge 
and skills are rigorously evaluated by written and oral exams, stan- 
dardized patient scenarios, and ward evaluations. However, evalu- 
ation of behaviors, including professionalism, is often implicit, un- 
systematic and. therefore, inadequate. This is problematic for 
several reasons. First, medical schools are doing a disservice to fu- 
ture postgraduate training programs, as well as to society', by not 
explicitly and accurately evaluating diis area during medical school. 
It is recognized that more complaints against physicians to medical 
societies relate to unprofessional conduct than to lack of knowledge 
or poor technical skills.* Yet students who display unprofessional 
behavior may not he identified in the current system, and will be 
promoted academically on the basis of adequate performance on 
tests of knowledge and skills alone.”'* 

Second, we are doing a disserv’ice to our students by not provid- 
ing explicit feedback in this domain, thereby missing valuable 
opportunities to bring about awareness and improvement. The 
American Board of Internal Medicine, in its report “Project Pro- 
fessionalism,” discussed the problem of erosion of professionalism 
during medical training. While knowledge and skills improve mark- 
edly over the four years of medical school, there is ample anecdotal 
evidence, and substantial quantitative evidence, that professional 
behaviors can diminish over this period.'*"^ There appears to he an 
unrealistic expectation that students will arrive at medical school 
lacking in knowledge and skills, but with a full complement of 
appropriate behaviors that require no further attention. However, 
all students are vulnerable to lapses in professional behavior and 
can beneht from explicit, systematic attention in this domain. The 
focus of medical education in the past century was on knowledge 
and skills. For the future of medicine, attention to the teaching 
and evaluation of professionalism is vital. 

While this need to evaluate professionalism effectively has been 
recognized for some rime, traditional methods of addressing the 
problem have not been particularly successful, for several reasons. 
The traditional approach to this issue has involved the identifica- 
riori and definition of the attitudes and concepts that comprise the 
concept of professionalism (such as altruism, accountability, excel- 
lence, duty, honor, integrity, and respect). Evaluation methods that 
rely on such abstract and idealized definitions lead us to discuss 
people, rather than their be/uivfors, as being honest or dishonest, 
professional or unprofessional. This implies that professionalism 
rcprcsenr.s a .set of stable traits. 

Interestingly, a large literature exists that suggests the opposite. 
Many studies in personality psychology' have shown that the pres- 
ence of .specific personality traits does not predict behavior.'’’' For 
example, in one study of psychiatry residents, Minnesota Multi- 
phasic Personality Inventory testing revealed serious personality dis- 
orders in the two individuals who eventually lost their licenses for 
professional misconduct,' However, several other participants 
showed the .same personality traits, yet had no difficulty reported 
in 15 years of follow up. TTus, evidence suggests that the identi- 
fication of specific traits docs not allow us to predict an individual’s 
behavior. 

T'hcrc arc several reasons wliy this issue is important when dis- 
cussing rhe evaluation of profc.ssionalism. Stable trait measures do 
not rake into account a recognition that behaviors enacted oft«m 



involves an effort at resolving a conflict between two (or more) 
equally worthy professional or personal values. For example, it is 
easy to say that one must always tell the truth, and that one must 
always protect patient confidentiality. However, these values may 
occasionally come into conflict, and the ultimate choice the stu- 
dent makes will depend on the specifics of the situation,'’ ’'^ 

In addition, professional behaviors arc known to he highly con- 
text-dependent. One can imagine a basically honest person ly- 
ing to a patient given a particular context. This does not auto- 
matically mean that that person is dishonest, and therefore 
unprofessional. Certainly in social situations, a decision to always 
tell the fu!.' truth would be considered highly inappropriate. 

Although the issues of conflict and context are separate at a 
theoretical level, in day-to-day practice they are likely to interact. 
One study has shown that 87% of physicians surveyed indicated 
that deception is acceptable on rare occasions, for example, if the 
patient would be harmed by know’ing the truth, in order to circum- 
vent “ridiculous rules,” or to protect confidentiality.** Yet, when 
two specific professional values are in conflict, it is not always pre- 
dictable which of the two values will take precedence. For example, 
while it Is sometimes appropriate to lie in order to protect patient 
confident iality, there arc circumstances in which it would be con- 
sidered more appropriate to break confidentiality rather than tell a 
lie. As one participant stated, honesty is “usually” the best policy, 
hut ever>'thing is taken on a casc-hy-casc basis, and any actions 
taken depend on the specifics of the people and the situation.*” 
Traditional ways of evaluating professionalism do not make allow- 
ances for these gray areas. 

Another element of evaluating professionalism involves the pro- 
cess of resolving the conflict. The ultimate choice an individual 
makes, manifested as the behavior witnessed, docs not tell us how 
he or she arrived at the decision. Wc know nothing of whether the 
student recognized the professional “values” that were in conflict, 
or why the student chose to act in that particular way- So while 
focusing on behaviors rather than personality or character traits is 
important, we must also attempt to understand the process that led 
to the behavior. 

Thus, if we do not include conflict, context, and the process of 
resolution in our evaluation methods, we might not he able to 
conduct the most reliable, valid, and appropriate evaluation of 
these behaviors. 

Another reason for the lack of success of traditional approaches 
Ls that evaluators have not been willing to identify* an individual 
as unprofessional for actions that appear to be relatively minor. 
Thus, lapses in professional behavior tend to he ignored or sup- 
pressed, due to an understandable reluctance to apply the broad, 
harsh label of “unprofessional.”*' In <nie study, clinician supervisors 
admitted and dcmon.stratcd their reluctance to give negative feed- 
back regarding unprofessional hehas ior, even though in interviews 
they hud stated strongly that they would do so.*^ Even if faculty 
have this willingness, they have been found to have "difficulty in 
identifying problems, an Inability to verify problems, and fear of 
litigation” that inhibit their reporting of behavioral problems.^ 

This outcome arises, in part, from the fact that educators and 
researchers have traditionally focused on this problem from an ab- 
stract perspective. The definitions and subcatcgorics of the broader 
concept of profc.ssionalism describe the idealized person, the “con- 
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summate professional/' with no room for mistakes. Witli this the- 
oretical basis> if someone tells a lie, even tor a “good” reason, he 
or she could he suddenly labeled “dishonest,” and therefore, “un- 
professional.” The only thing left for the evaluator to decide, then, 
is how unprofessional the individual is. This top-down focus on 
professionalism as an abstraction rather than a bottom-up focus on 
professionalism as a set of actions in context, therefore, is flawed. 

This paper elaborates on the issues around this problem. First, 
we review the literature on the types of evaluation instruments used 
for measuring professionalism in medical education. We then out- 
line fundamental conceptual deficiencies that exist in this litera- 
ture. We argue that the three most important missing components 
arc: consideration of the contexts in which unprofessional beha\ - 
iors occur, the conflicts that lead to these lapses, and the reasons 
students make the choices they make. We then propose strategies 
for resolving these issues. 

Method 

We conducted searches through Medline, Psychlit, and ERIC for 
literature published over the past 20 years. We included studies chat 
contained original research on the topic of assessment or evaluation 
of professionalism in medical cducarion, or included instruments to 
measure professional behavior, professionalism, humanism, behav- 
iors, values, and attitudes. After initial articles were identified, bib- 
liographies were used to identify additional references, and experts 
in the field were consulted for missing but relevant papers. This 
process uncovered few studies addressing specific efforts to evaluate 
professionalism. There was an abundance of articles calling for new 
and better methods of evaluation, and arguments for why this is so 
important and neglected. Some papers dealt with certain aspects of 
professionalism, for example, ethics, communication skills, inter- 
personal skills, and humanistic behavior, but they did so without 
extrapolation to the larger notion of professionalism. These studies 
were included if they highlighted difficulties in evaluating profes- 
sionalism or provided new insights or solutions, and contained orig- 
inal research. 

Results 

Eraliauions by Facuhy Supemsors. In 1979, the AAMC inter- 
viewed approximately 500 clerkship directors about “problem stu- 
dents.” They identified 21 t>’pes of problem students, and then 
asked how often each t>'pc of problem was seen, and how difficult 
the problem was. Among the results from the University of Wash- 
ington School of Medicine, researchers found that “noncognitive” 
issues (e.g., bright but poor interpersonal skills) were “frequent and 
difficult,” but that the very disturbing ones (e.g., cannot be trusted, 
manipulative) were seen only infrequently.'^ Though rhis survey 
was done many years ago, it provides an early glimpse of faculty’s 
concerns about the professional behaviors of students. Since then, 
various other studies have analysed approaches used by faculty in 
the evaluation of professionalism, including global rating scales, in- 
training evaluations, and encounter cards. 

Ward rating forms, completed by the physician-supervisor, are 
the most commonly used instruments. In addition ro assessing med- 
ical knowledge and clinical skills, many of these forms have a single 
global item ro assess professional behavior, which may be subject 
to extensive rater bias.^**'' A study by Woolliscrofi er al. highlights 
some of the problems of using this r^'pe of assessment. The authors 
found that using a questionnaire, faculty could assess the human- 
istic qualities of internal medicine residents, at least for the item 
“doctor-patient relationships.”*'* However, it would take 20-50 fac- 
ulty members per resident to achieve acceptable rcproducihiliry, 
which calls into question the utility of this imstrumenr. This also 
suggests that the trait doctor-patient relationships is probably not 
stable, but rather may be subject to context bias. Pifferent eval- 
uators might see different behaviors (u make different interpreta- 



tions. In a related study, Johnson found that physicians* and nurses’ 
evaluations of intensive care unit residents correlated highly with 
respect to all criteria fxcept the assessment of humanistic qualities, 
hirther highlighting the importance of context.’'^ 

To compensate for the problem of infrequent ohsen’ations, sys- 
tems have been developed that encourage the repeated observation 
and documentation of the perfomiances of medical trainees (often 
on a daily or weekly basis).’"’"' Tins allows for the assessment of 
knowledge, skills, or professional behaviors with reasonable inter- 
rater reliability and construct validity. Such real-time evaluations 
permit early intervention, facilitate feedback, and guide remedia- 
tion. However, in a study of encounter cards in the evaluation of 
anesthesia residents, despite numerous negative comments by su- 
per\'isors, only 1% of the comments were found to he about un- 
professional behaviors.’" Further, those residents who received these 
negative comments were only rarely rated overall as “performing 
below level” by their supervisors, despite their all having had crit- 
ical incident reports and scoring lower on objective testing. This, 
again, highlights the difficulties faculty have in documenting un- 
professional behavior. 

Faculty can, in fact, be trained to accurately observe and assess 
specific behaviors. One group developed a reliable assessment of a 
ver>' specific set of humanistic skills (e.g., introduced self to the 
patient, acknowledged the agenda from the last \isir) by asking 
faculty to view videotapes of residents’ interactions with patients.’’ 
However, even if faculty' can identify' problematic behavior in a 
reliable way, they are often reluctant to record it. Burack, using 
a rigorous qualitative method, demonstrated that faculty have a 
marked reluctance to respond unambiguously to behaviors that in- 
dicate negative arrirudes towards patients. In interviews, faculry 
stated that they would not tolerate “this sort of behavior” and 
would “definitely lay down the law” if such behavior were observed. 
However, in practice they usually did not respond at all, or did so 
in such a way as to require interpretation by the learner. The feed- 
back can then be misinterpreted to be permissive. As explanations 
for this dichotomy, clinicians reported their sympathy for the learn- 
ers’ stress, as well as the possible penalties educators can face for 
giving negative feedback, such as receiving had teaching evalua- 
tions and being open to personal and legal risks. They felt that if 
the observed behavior is only a lapse, and the learner is funda- 
mentally “good,” corrective feedback might discourage or frustrate 
the resident. Conversely, for fundamentally “bad” residents, correc- 
tive feedback is seen as futile. 

Tlicrefore, methods that exist for faculty evaluation of profes- 
sional behavior are problematic. Evaluations cannot be kept on 
theoretical, abstract, or definitional levels; thus, these scales have 
poor reliability. Numerous observ’ations in various contexts need to 
be made, but attending physicians arc presenr for only a small pro- 
portion of the time. In addition, even when lapses in professional 
behavior are identified, there is great reluctance to reptirt them.''* 

Nurses and Pmients. Some of the reluctance faculty have in eval- 
uating professional behavior results from potential conflict in their 
roles as teacher, mentor, and evaluator. Other groups, such as pa- 
tients’’’"' or nurses,'^ ’''"' may not be subject to these conflicts. In 
addition, these other groups may see the students and residents 
more often and in different contexts. Woolliscroft s study included 
groups oi nurses and patients; unfortunately, the patients’ ratings 
were not reliable, and it would have required up to 50 patients’ 
assessments ro achieve a reprcxlucible estimate of professional be- 
havior. Nurses achieved good reproducibility with ten to 20 as- 
.sessments per resident, hut this amount may .still he impractical. 
Because professional behavior is so context-specific, it is not sur- 
prising ihar only low to modest correlations exist between rating.s 
by these different assessors. Also, nurses and patients may face dif- 
ferent kinds of pressures that could deter their unbiased reporting 
of unprofessional behaviors; for example, a patient may be reluctant 
to jeopardize the continuity’ of a relationship with a physician even 
l^though it is prohlL*m<itic. In addition ro highlighting some of the 
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difficulties in evaluating professional behavior, Woolliscroft et al.’s 
study provides a good example of an attempt to triangulate results 
as a measure of validity. 

Peer Evaluatio).. Peers are in a good position to evaluate each 
other’s professional behaviors because of frequent, close, and varied 
contact. Thus, the use of peer assessment of professional behaviors 
may solve many of the problems described for facult>'’s assessment. 
However, several problems remain and some new problems may 
arise through the use of peer assessment. 

On a positive note, there is some suggestion that medical stu- 
dents’ peer evaluations may be the best measures of interpersonal 
skills available.’*’"^''' Thomas et al. reported a pilot study of peer 
review in residency training using a ten-item questionnaire." The 
items on the form clustered into two domains: “technical skills” 
and “interpersonal skills,” which included humanistic behaviors. Of 
panicular interest is this study’s finding that intern peer evaluations 
of a composite “professionalism” domain correlated well with fac- 
ulty evaluations of the same dimension (r = .5?, p < .05). An 
interesting modification of a ranking system that forces students to 
discriminate among their peers based on certain dimensions of pro- 
fessionalism has been described." The authors suggest that such a 
system enables identification of the top 10-15% of the class, but 
it is not helpful in discriminating among the rest, perhaps because 
the students were asked for only positive nominations on the peer- 
evaluation form. 

On the other hand, peers, like faculty, seem to have a difficult 
time discriminating the abstract dimensions of professionalism from 
each other and from orher skills. For example, in a srudy of peer 
assessment of professional dimensions, Arnold found very high in- 
ternal consistency (coefficient alpha) across the dimensions, sug- 
gesting a strong halo effect in the ratings of the separate dimen- 
sions.*'^ Further, scores were highly correlated with more 
knowledge-based measures such as National Board of Medical Ex- 
aminer’s exam (Parts i and 11) and grade-point average, suggesting 
that dimensions orher than professionalism were also contributing 
to the scores. Also, as with faculty' ratings, it would appear chat a 
fairly large number of ratings are necessary to obtain stable measures 
across raters.”'" Interestingly, the numbers of negative peer evalu- 
ations generated in the small groups depended upon the kind of 
faculty leadership exercised in each group.*'’ This constitutes yet 
another example of the importance of context and social climate 
in peer (and other) assessment methods. 

In fact, the social climate of peers assessing peers may have neg- 
ative consequences. TTat is, while some studies report positive re- 
ception of peer feedback, others report marked resistance to peer 
evaluation even though the evaluations were anonymous and for 
research purposes only."'^’ Heifer found that senior medical stu- 
dents were more accepting of peer evaluations than were junior 
students, who lacked confidence in the usefulness of the system." 
Van Rosendaal found that residents worried that the process would 
undemiine their work and personal interrelationships." 

In summary', peer evaluations hold promise for evaluating pro- 
fessionalism. However, before they are likely to be very useful, many 
of the same problems facing faculty’s evaluation of professionalism 
will have to be solved, and evaluation systems must be developed 
that will overcome the reluctance of peers to rate one another. 

Self Eveduanon. Several early studies were conducted that in- 
volved self-reports of attitude changes during medical training. To 
varying degrees, these students reported increases in certain atti- 
tudes, such as cynicism; were more concerned about making money; 
or felt that their ethical principles had become eroded or lost.^ '' ’*^ " 
Some positive attitudes increased as well, for example, concern for 
patients, and helpfulness.’ More recently, Clack studied gender dif- 
ferences in medical graduates’ self-assessments of personal attributes 
and found that women generally felt more confident than men in 
possessing nine of the 16 “ideal” attributes listed.'^' These studies 
indicate that our understanding of students’ attitudes, some of 
which may reflect aspects of “professionalism,” can benefit from 



self-report questionnaires. However, these studies are comparing 
groups and trends, not assessing the qualities of individuals. The 
utility of self-reporting for these purposes might be much more se- 
verely limited. 

Most studies of self-assessment in medicine focus on the assess- 
ment of knowledge and skills rather than on professional behavior, 
hut they generally conclude that self-assessment is quite inaccu- 
rate.*^’’'’ If physicians are inaccurate at self-assessment in relatively 
concrete domains (e.g., knowledge), they are likely to have even 
greater difficulty in a domain such as professionalism, which is less 
well defined and more socially value-laden. A recent line of re- 
search. for example, introduced a model of self-assessment described 
as the relative ranking technique, in which each participant ranks 
a set of skills relative to each other from the skill that needs the 
most work to the one that needs the least.’’^'" Despite some success 
as a self-assessment tool in the relatively constrained domain of 
intcr\’iewing skills, the technique w’as far less useful when applied 
to residents’ self-assessments of the standard components of a ward 
assessment form, in this context, the authors discovered that al- 
though residents were quite willing to say they need “the most 
work" with their surgical skills, or ro improve their knowledge base, 
all residents responded that they needed “the least work” in col- 
league and/or team relationships.’* It appears that when statements 
are value-laden and abstract (as in issues of professionalism), the 
bias of social desirability is strong, and self-assessment becomes dis- 
torted and potentially misleading. 

h is apparent that ihe use of self-assessment in the evaluation of 
professionalism is difficult. The methods used do not take context 
into account, making them somewhat threatening. Perhaps a rel- 
ative ranking system could be attempted that included only ele- 
ments of professionalism, such as interpersonal skills, communica- 
tion skills, respect, and integrity. However, it would still be unlikely 
for a student to say he or she needs more work with honesty. Again, 
behaviors rather than abstract definitions would need to be incor- 
porated to overcome this limitation. Until further research is done 
to better understand the nature of self-assessment, its utility for 
assessing professional behaviors is likely to be limited to formative 
evaluations and the setting of personal goals. 

Standardized Patients. There is an extensive body of literature on 
objective structured clinical examinations (OSCEs) and standard- 
ised patients (SPs) and their importance in the evaluation of clin- 
ical skills. There is no literature specific to the role of either in the 
evaluation of professionalism or professional behaviors within med- 
icine; however, there are areas in which issues of professionalism 
and professional behaviors are touched on indirectly. 

Using an adaptation of the American Board of Internal Medi- 
cine’s Physician Satisfaction Questionnaire, Klamen et al. found 
that SPs could reliably identify some of the professional character- 
istics of the doctor-patient interaction, including using understand- 
able language and encouraging patients to ask questions.^’*" By 
contrast, Schnabel et al. asked SPs to assess empathy, interpersonal 
skills, and patient satisfaction on a 13-item checklist used in a 
senior-medical-student OSCE, and found that up to 20 ratings were 
needed to generate reliable measures.” At the extreme, research 
conducted using OSCE stations to assess students’ skills in dealing 
with ethical issues concluded that 41 stations would be required to 
achieve g<x)d reliability, even if the content domain were narrowed 
down to one specific ethical dilemma," ’*' 

At least in part, the difficulty with using OSCE scenarios is the 
ambiguity with which the concepts are defined on the evaluation 
form. For example, one set of forms used such anchors as “major 
problems in demeanor or ethical standards resulting in inadequate 
ability to deal with the patient's problems” and “actions taken may 
harm the patient."’' '’*^ In both instances, unacceptable behaviors 
arc not specified, and judgment is left up to the examiner. On a 
related note, Arnold suggests that the OSCE, as it now exists, does 
not discriminate between ethical analysis of a problem and com- 
munication skills." 
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Another issue with SPs’ assessment is the problem of artificiality. 
Norman, for example, reported on the experience with a physicians’ 
remediation program that uses standardized patient scenarios.^'’^ SPs 
in a simulated office practice, as well as in standard OSCE stations, 
were asked to rate physicians’ interpersonal skills during each en^ 
counter. Compared with the office simulations, the OSCE stations 
had a low reliability’ and were felt to be “artificial.” This may in- 
crease the likelihood that students In this setting might act as they 
should rather than as they would. On the other hand, one study has 
teported sevetal professional lapses in the context of a psychiatiy^ 
OSCE (the most extreme case involving a students placing a flec' 
ing SP in a headlock for the purpose of restraint).'" Hodges et al. 
argue that if stations are more demanding, they may veiy' well diS' 
criminate effectively in tenns of professional dimensions. Similarly, 
Vu et al. suggested that SPs’ ratings were highly reliable and valid 
when compared with comments real patients would be expected to 
make regarding the behaviors they witnessed.’* 

Again, it is apparent that context is important. Methods of aS' 
sessment that are more true to life may be more useful than those 
that involve obviously artificial situations. Students may he aware 
that there is a professionalism station and respond with actions they 
assume are on the checklists. It would be interesting to include 
values conflicts in SP scenarios to specifically assess the students’ 
awareness of the professional values that are involved, and to evah 
uate their responses. In such a case, there may be more than one 
right answer, so the students’ thought' processes about their actions 
may he more important than the behaviors they actually display. 
The low reliability of OSCEs, even wIkd limited to specific di- 
mensions of professionalism, is concerning, and many authors have 
concluded that the greatest utility of this type of assessment may 
be in the formative evaluation of students. 

Longitudinal Obsetvatioi\s. More recently, reoearchers have devel- 
oped systems for assessing students' professionalism that are trig- 
gered by the observation of problematic student behaviors."'*"’ The 
evaluation instrument is a specific fiirtn that is completed by a 
clerkship director or faculty member when a student exhibits un- 
professional behavior during a rotation. When more than one form 
has been completed for a specific .^,nident, a meeting between an 
academic committee and the student occurs and remediation is 
instituted. These systems are based on the concept that students’ 
professional behaviors must be assessed longitudinally, across nu- 
merous clinical rotations. Both studies describing this evaluation 
tool have been qualitative descriptions of systems that are in place, 
and further reliability and validity studies are anticipated. Such 
systems are very promising, de.spite a lack of rigorous evaluation, 
and may work well for identifying those students with significant 
lapses in professional behavior. However, in their present state, they 
may nor prove as useful as a method of evaluating all students. The 
important advance these authors have made is their acknowledge- 
ment that labeling a student as "unprofessional” carries a greater 
negative connotation than .simply recording examples of unprofes- 
sional behavior. 

Discussion: Future Directions in the Evaluation of 
Professional Behavior 

It should he apparent from the preceding discussion that evaluating 
professionalism in medical students and residents has proved to he 
a difficult task. The definition-driven abstract way of thinking 
about professionalism creates a dichotomy for faculty: either apply 
a harsh label, or let the lapse go. We know from previous research 
that faculty are much more likely to let the lapse go. which effec- 
tively suppresses discussion, feedback, and attempts at remedia- 
tion."^ 

On the other hand, evaluation methods that consider behaviors, 
rather than individuals, a.s professional or unprofessional become 
much less threatening and would he more likely to gain acceptance 



by faculty and students. The studies reported by Papadakis et al. 
and Phelan ct al. provide two good examples of such systems.' * 
Perhaps these methods will decrease faculty’s reluctance to report 
behaviors that should lead to remediation; this can only help in 
promoting students’ professional development. As developed, these 
evaluation forms are intended to identify and document serious 
lapses in professional behavior, which fortunately occur in only a 
few students. Future research might focus on ways to make these 
forms useful in the evaluation of all students. However, it is likely 
that some barriers to their use would still exist; for example, faculty' 
would still have to decide what constitutes a major or minor in- 
fraction. These limitations might be minimized if the behavior is 
placed in a context (of the person, the situation, the harm caused 
to others), a fair process of review is used, and reasonable judgment 
is applied.'’ Then, any decision made would be justifiable and well 
supported. Arnold and colleagues use a hybrid of the behavioral 
and abstract in their measurement tool by attaching behavioral de- 
scriprots (such as “I have seen residents refer to patients in derog- 
atoiy terms”) to abstract dimensions of professionalism (such as 
“respectfulness”), which is an interesting potential step in this di- 
rection.” 

We have also argued that professional behavior is much more 
contexC'dependenr than has usually been acknowledged. All phy- 
sicians are exposed to situations that challenge their abilities to act 
professionally, and medical students and residents are no different. 
In fact, they may he more vulnerable to lapses in professional be- 
havior because of the nature of their training and environment. It 
is crucial to he aware of the specific context in which a behavior 
occurs before attempting to evaluate it. For example, Christakis ct 
al. found that the teaching students had received on ethical dilem- 
mas seemed to lack real-life relevance and related more to the con- 
text of a practicing physician." Focus groups described different 
dilemmas, which were unique to a third-year student’s experience. 
They highlighted the conflicts between education, patient care, 
wanting to be a ream player, and fear of a poor evaluation- One 
overriding feature was the construct of authority: students lack it 
and are wary of challenging it, which often purs them into conflict. 

It may he necessary to study these behaviors in context more 
closely to deterraine their frequency and severity. Since we know 
that faculty, nurses, students, and residents all see different aspects 
of professionalism in students, it would be important to gain the 
perspectives of each of these groups in order to be comprehensive. 
One way could be to involve each of these groups in focus-group 
discussions, to determine what they consider to he professional and 
unprofessional behaviors. Their unique perspectives would help in 
the design of instruments used in all forms of student assessment. 
Another technique could he to use an anonymous encounter card 
system to collect information from students, residents, faculty’, and 
nurses, about what behaviors are actually occurring. This may pro- 
vide us with a more comprehensive set of behaviors on which to 
base future evaluation metliods. 

Conflict has also long been identified as a critical component of 
professional development, and is found as a dominant element in 
some measures of professional behavior.*^ " Although such paper- 
and'pcncil instruments are limited by their artificial nature, some 
researchers have found that professional behavior can best he iden- 
tified at the rime that students are grappling with these conflicts. 
One potential implication of this finding is that students could he 
placed in a situation that involves a conflict of values, for example, 
with a standatdized patient. The behaviors the students display, 
based on the choices they make, could he evaluated. What might 
he even more informative is an evaluation of the thought process 
a student goes through to arrive at his or her ultimate choice. 

Alternatively, students could be asked to write about professional 
conflicts they have encountered.'^' Tlie language or text from these 
experiences could be subjected to linguistic or rhetorical analysis 
to uncover the underlying values of individual students and explore 
how these values affect the resolution of profe.ssionai conflicts. Lin- 
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gard and Haber’s studies use a rhetorical ftamewotk to explore how 
the structural patterns of case presentations inform medical stu> 
dents’ developing attitudes towards patients and colleagues.^' ’’’ The 
authors demonstrate that a rhetorical analysis of discourse patterns 
can reveal critical relationships between the stories novices learn 
to tell about patients and the decisions they make about how to 
act on behalf of and in relation to them. Other studies in a similar 
vein reinforce the potential usefulness of this method. How- 
ever, the texts that students generate may suffer from the same 
sense of artificiality that affects OSCE stations, and research in this 
area would have to be designed to take this issue into account. 

It is unrealistic to think that one evaluation instrument could 
capture all that is important in the complex domain of profession- 
alism. As with all high-stakes evaluations,, reliability, which de- 
pends in part on sample size, is important. No student should re- 
ceive a grade on his or her knowledge of cardiology from a 
single-item test; similarly, no student should receive a grade on 
professionalism without adequate sampling of the domain. Some of 
the measures outlined above have large sample sizes and are likely 
to be more useful (peer evaluation, encounter cards), while others 
rely on a single report or a few reports (SP scenarios, ward evalu- 
ations). While the latter may be useful for outliers, the former are 
more useful for the larger group of students who experience only 
occasional lapses in professional behavior. It is certain thar more 
than cme measurement technique would need to be used, and the 
greatest validity may result from triangulating results from different 
sources. 

Future efforts at understanding professionalism, and future meth- 
ods of evaluating professionalism, must focus on behaviors rather 
than personality traits or vague concepts of character. Our under- 
standing and evaluation must include context and conflict in order 
to be relevant and valid. Ideally, methods of evaluation should in- 
clude elements of peer assessment and self-assessment, which are 
becoming required elements in the continuing professional devel- 
opment of all practicing physicians. Finally, we should attempt to 
understand what drives students to demonstrate occasional lapses 
in professional behavior, in order to develop effective teaching and 
remediation in this domain. 

Corrcspi^ndcncc: Shiphra OrnNhurp, MO, .Nit. Sm.n Ho<pK.il, 600 Unjvcr>i!\ Avenue, 
Toronto, t>itario. MSG 1X5. Canada. 
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Tracking Knowledge Growth across an Integrated Nutrition Curriculum 

CAROL S. HODGSON 



Both the academic literature and the popular press continually re- 
port the importance of nutrition for health. One example is the 
increasing prevalence of obesity in the United States* and the con- 
comitant popularity of diet books and untested remedies. A 1985 
National Academy of Sciences report warned of the lack of nutri- 
tion education and the need for a required curriculum for all med- 
ical students in U.S. medical schools.^ Results from annual Asso- 
ciation of American Medical Colleges (AlAMC) Graduation 
Questionnaires reinforce this conclusion. In 1995, 63% of students 
reported that they had received inadequate nutrition in their med- 
ical school curricula.* In 1998, nothing had changed; 64% of stu- 
dents still reported inadequate nutrition education.'* 

Following the 1985 National Academy of Sciences report, fund- 
ing from the National Cancer Institute (NCI) stimulated devel- 
opment of nutrition curricula at a number of medical schools.^"' In 
one study, the use of a multimedia program to teach nutritional 
assessment and counseling was evaluated.* The authors found that, 
following exposure to the multimedia nutrition program, first-year 
students were more likely to use a food-frequency form while in- 
terviewing a standardized patient compared with previous students 
who had not received the intervention. A majority' of students 
(51%) who completed the curriculum reported that obsen'ing a 
physician model nutritional assessment and counseling in the mul- 
timedia program had been helpful. In another study, the evaluation 
of a two-year integrated nutrition curriculum implemented during 
the basic sciences indicated increased knowledge for those students 
who had completed the curriculum.' Although limited in scope, 
these studies are promising. Tliey imply that, even when nutrition 
is not a major aspect of the medical school curriculum, first- and 
second-year smdents’ knowledge can increase and they may apply 
their knowledge to patient care. 



plemented changes in years one, two and three of the curriculum. 
New instructional and examination materials were developed to 
foster accomplishment of nutrition proficiencies outlined on the 
Web site above. The development of ongoing curricular review' and 
evaluation processes tracked growth of nutritional knowledge in 
those students exposed to the revised curriculum. 

Modification of targeted courses to emphasize proficiency with 
nutritional concepts was the primary strategy' of the curricular 
change. The nutrition curriculum is concentrated in the first-year 
course, Human Biochemistry and Nutrition Laboratoiy'. A number 
of nutrition-related cases arc also included in two first-year PBL 
courses. Nutrition is included in approximately ten lectures of the 
second-year course, Pathophysiology of Disease. New curricular ma- 
terial was incorporated into the required third-year family medicine 
clerkship and the Doctoring 3 curriculum, where students interv'iew 
standardized patients. Numerous fourth-year nutrition electives are 
offered, but their impact is limited because very few students take 
these electives. 

In this study, we examined the effect of changes in the nutrition 
curriculum on students’ knowledge over four years of medical 
school. Based on earlier findings, and further development and im- 
plementation of nutritional content in the clinical cuniculum, we 
hypothesized that students completing an integrated four-year nu- 
trition cuniculum would demonstrate, on a Nutrition Progress Sur- 
vey, a continual increase in their nutrition knowledge over time. 
V/e also hypothesized that they would demonstrate more confi- 
dence in their responses through a decrease in their use of “don’t 
know" as a response to surv’ey questions. 

Method 



The increase of nutrition knowledge following exposure to the 
clinical cuniculum is potentially of greater importance than is the 
nutritional content of the basic science cuniculum, since clinical 
exposure may he more likely to lead to application in practice. 
Many studies report physicians’ lack of knowledge and confidence 
in using nutritional concepts in their practices.*' ” The paucity of 
physicians who model the use of nutrition concepts in their prac- 
tices could have a negative effect on students' acquisition of knowl- 
edge and their application of that knowledge to patient care.*' 

At our institution, cognitive learning theory (i.e., actively en- 
gaging students in learning**) guided the development of a new 
nutrition curriculum. The curriculum’s goals were to increase stu- 
dents’ (1) learning and retention of nutritional concepts; (2) skills, 
such as diet-assessment methods; and (3) application of content to 
patients’ care. To accomplish this, we planned to increase oppor- 
tunities for practice with nutritional concepts throughout the four- 
year curriculum using active learning methods such as laboratory 
exercises, a dietary self-assessment, interviews with standardized pa- 
tients, and discussions in small-group problem-based learning (PBL) 
sessions. 

In 1992, we started the curricular planning process by conducting 
a nutrition needs assessment. We received funding of an NCI R25 
grant (NCI PAR 94-005) in 1994, and a Nutrition Education 
Committee was formed to develop and implement the new nutri- 
tion curriculum. The Committee established goals and objectives 
(outlined on our Web site (http://apps.medsch.ucla.edu/nutrition/ 
ohjcctives.html)), reviewed existing courses and clerkships, and im- 



We used a prc-/post-test intact-group design to evaluate changes in 
the nutrition knowledge of a cohort of medical students as they 
progressed from their first to fourth years (class of 1998). Test items 
that originally had been developed at the University of Alabama 
and had been demonstrated to he valid and reliable measures for 
assessing the nutritional knowledge of medical students formed the 
basis of our 90- item Nutrition Knowledge Progress Survey. In order 
to decrease the use of guessing, students were given an additional 
response option, “don’t know,’’ for all questions. The students were 
informed that the test items would he scored (correct = + 1 , in- 
conect = —1, and don’t know = 0). All students completed an 
informed consent form prior to entering the study. 

The nutrition sun’ey w'as administered as a pre-test to the first- 
year class in January 1995. A 45'itcm subtest of the survey (30 
items expected hy first-year course chairs to he initially covered in 
the first-year curriculum and 15 randomly selected items) was ad- 
ministered in May 1995 (post-test 1) to the same cohort of stu- 
dents. Delayed y it-test exams were given to third-year students in 
August 1996 (post-test 2) and to fourth-year students in August 
1997 (post-test 3), Two forms of post-test 2 were administered to 
third-year students: the full 90-item and the 45-itcm suhtests of the 
survey. Earlier we reported no significant difference between the 
scores on the 45 items in common on the two forms of the test.’ 
The 90-item exam was given at the start of the fourth year to a 
randomly selected half of the cohort (n =76). Those students who 
completed all four previous suivcys were asked to complete one 
more at the end of the fourth year (post-test 4). Each fourth-vear 
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Table 1. Students' Responses to the Nutrition Knowledge Progress Survey* 





Pre-test 
Mean (SD) 


PosHest 1 
Mean (SD) 


Post-test 2 
Mean (SD) 


Post-test 3 
Mean (SD) 


Repeated-measures 

AN0VA4 


F(df) 


P< 


Total score 


9.50 (5.40) 


15.25 (7.88)4 


18.21 (5.80)§ 


22.96 (5.38)1! 


37.03 (3, 21) 




Number correct 


18.04 (4.67) 


26.50 (4.29)4 


28.79 (4.41 )§ 


32.04 (3.57)f 


61.89 (3. 21) 




Number “don’t know" 


18.42 (5.36) 


7.25 (4.24)4 


5.62 (4.82)§ 


3.50 (3.04)11 


64.69 (3, 21) 




Number incorrect 


8.54 (2.67) 


11.25 (4.65)4 


10.58 (2.99) 


9.08 (2.65)11 


3.70 (3, 21) 


.05 



•Total score (scored +1 for a correct answer, 0 for "don’t know” and -1 for an incorrect answer), number correct, incorrect, and ‘‘don't know” for 45 items in common 
administration for those students who completed the first four test administrations (// = 24) compared across four test administrations using a repeated-measures ANOVA. 
t Significant contrasts using difference method are: t comparing post-test 1 with pre-test; §comparing post-test 2 with post-test 1: tr comparing post-test 3 v/ith post-test 2. 



student who participated received a $100 gift certificate as an in- 
centive. 

Total scores were calculated by summing the scores for the items 
in each exam (correct = +1, incorrect = — 1, and don’t know = 

0). The 45 items in common for each test were summed to form a 
total score for each administration of the surv'ey. Additionally, the 
30 items in the survey covered in the first-year curriculum and used 
in subsequent years were summed to create a total score for the 
first-year curriculum in order to test for learning and retention of 
material. In order to test whether the total numbers of correct^ 
inconcct, and “don’t know” answers changed over time, total scores 
for these answers were calculated by summing the number of re- 
sponses for each category. A repeated-measures analysis of variance 
(ANOVA) was used to examine changes between the time points. 

The difference method was used to compare each time point with 
the previous one. 

It was possible that those students who completed the surv'cy 
every time it was given differed from those students who did not 
(i.e., were more knowledgeable about nutrition). In order to test 
this, a sample was randomly selected (equaling the sample size of 
those who completed all four surveys) from those students who had 
not completed all four exams. Pre-test, post-test 1 , and post-test 2 
total scores were compared for these two groups. 

Results 

Approximately 90% of the cohort completed at least one of the 
four exams: 88% at pre-test (n - 130), 93% at post-test 1 (n = 
136), 72% at post-test 2 (n = 89), and 70% at post-test 3 (53 of 
76 students recruited to complete the nutrition survey). Fifty-three 
percent of the students (n = 78) completed the first three exams. 
Twenty of the 24 students (83%) who filled out the first four surveys 
completed post- test 4. 

The first set of data reported includes all students who completed 
the nutrition suiv'ey at the first four test administrations except for 
the comparison group. The second set of data reported includes 
only those students who completed the survey at all five adminis- 
trations. 

There was a significant increase in knowledge over the four test 
administrations (see Table 1). Results of the repeated-measures 
ANOVA showed a significant increase in knowledge over the 
three-year time period using the 45-item subtest. The number of 
correct answers increased; the numbers of incorrect and “don’t 
know” responses decreased. Within-subject comparisons between 
each time period and the previous time period were also significant 
(see Tabic 1), indicating a significant increase in knowledge from 
one time point to the next. In addition, knowledge relative to the 
content covered in the first-year curriculum (30-itcm suhrc.st) in- 
creased over time (see Figure 1). 

Students who completed the first four nutrition surv’eys (n = 24) 
were compared with randomly selected groups of students who did 
not complete all four surveys on pre-test, post-test 1 , and post -test 
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Figure I. Mean scores 2: 2 standard errors on the Nutrition Knowledge Progress 
Survey (30 items from the first-year curriculum) for those students who 
completed all five test administrations (n = 20). Repeated-measures ANOVA: f = 
23. i (4, 16), /> < -(XU. The test scored -Hi for correct response. 0 for “don’t 
know,” and - 1 for an incorrect answer. 



2 mean scores. There was no significant difference between the two 
groups on any of these measures, indicating that there was no bias 
in terms of nutrition knowledge as to who completed all of the 
four nutrition sun’eys. 

Discussion 

Results from this study indicate that the goals of the curriculum 
were met; medical students who received the longitudinal inte- 
grated nutrition curriculum did increase their knowledge over time 
and retained the knowledge l- amed in the first year through the 
third year (see Figure 1). In addition, students appeared to he more 
confident in their responses, since they decreased their use of the 
“don’t know” response, even though they risked losing points for 
an incorrect answer. These results, however, do not mean that stu- 
dents are more able to apply their knowledge in the clinical setting. 
In contrast, anecdotally, we know from speaking informally with 
fourth -year medical students that they felt very uncomfortable be- 
ing alone in an exam room with a patient W'ho asked about diet or 
supplements. Results from the AAMC Graduation Questionnaire 
confirmed this. Even though our students clearly increased their 
knowledge over time, 68% of this cohort still reported inadequate 
education in nutrition in their curriculum, compared with 64% 
nationally.^ This finding might reflect the students’ greater under- 
standing of the importance of nutrition in clinical practice ha.scd 
on the cuniculum. On the other hand, it may he that their own 
clinical experience, although limited, had infoimcd them of their 
need to know. 
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There are a number of limitations to this study. First, only one 
school was studied, so results might not be comparable in another 
school. There may have been a test effect from using the same 
survey over time, although given the time lag between administra- 
tions and the lack of grading associated with it, this seems unlikely. 
There may have been sample bias if those students who completed 
the survey all five times were more interested in nutrition. Again, 
this is unlikely given the comparison of those students who took 
all five te.sts with those who took only the pre-test, post-test 1, or 
post-test 2. Finally, it is possible that the results are purely from a 
maturation effect. This is not likely, however, given our earlier 
study results indicating no significant difference in a compari.son of 
scores on post-test 2 of a control group (those not completing the 
nutrition curriculum) v/ith scores of students who had completed 
the nutrition curriculum.' 

Results from this stud\ are promising, but there is still a w'ay to 
go — one of the biggest hurdles remaining is incorporating nutrition 
into the clinical curriculum. The average number of items answered 
correctly by those graduating students who completed the survey 
was 32 of 45, indicating a marginally passing score of 71%. This 
denotes an increase of only 13% from their scores at the end of 
the first year. However, these results are similar to those of a multi- 
school study conducted in the late 1980s, in which fourth-year 
students at 1 1 southeastern U.S. medical schools scored an a\'eragc 
of 69% on a similar sur\^ey. Scores were related to the amount of 
required nutrition curriculum the students had experienced. Al- 
though knowledge scores increased, students* attitudes with respect 
tt> the importance of nutrition for their careers deteriorated from 
year one to the end of the clinical curriculum.*^ At our institution, 
nutritional content increased in the third-year curriculum, hut little 
advancement w'as made into any clerkship except family medicine. 
Given the general lack of nutrition knowledge of clinicians,*" “ it 
is likely that there were few preceptor role models who demon- 
strated or reinforced nutritional assessment or dietary- counseling of 
patients. Consistent with this are the results of a study comparing 
the nutrition knowledge of our fourth-year students with that of 
physicians attending a local nutrition continuing medical education 
course. The students significantly outscored the physicians in nu- 
trition knowledge (68% versus 52%). 

Last, although a case with nutritional content was inserted into 
our senior clinical performance examination, this occurred after 
this cohort of students had graduated. Therefore, the only change 
ohser\^ed, (an increase of students’ knowledge of nutrition con- 
cepts) provides no evidence that .students will apply this informa- 



tion in clinical practice — our ultimate goal. Further studies are 
needed to examine this potential effect of the curriculum. 

This work va.s supported by a Nacionai Cancer Inititutc R25 (jpant (NCI PAR 94- 
005). 

CAMTcspt'ndcnce: Carol S. Hodgson, PhD, UCL^ Schoid of Medicine, 108^3 Le 
Conic, 60-051 CHS. L*ts Angeles, CA 9(X)95-1722. 
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• KEEP ON TRACKING 



Moderator: Barbara Barzarxsky, PhD 



Following Medical School Graduates into Practice: Residency Directors* Assessments after the 

First Year of Residency 

GWEN L. ALEXANDER, WAYNE K. DAVIS, ALICE C YAN, and JOSEPH C. FANTONE 111 



Extensive resources arc de\'oted to preparing medical students to 
practice in the demanding world of medicine. While students* 
progress is extensively monitored during medical school, very few 
medical schools have reported research showing the relationship of 
medical school preparation to performance during residency edu' 
cation.' There is growing recognition of the need for measurable 
outcomes of medical education. Performances of graduates in their 
residency programs provide one outcome that could be used to as- 
sess the quality of medical school educational programs. The pur- 
pose of this study was to consider information about the perfor- 
mances of our graduates, assessed early in their residency education 
by residency program directors, and to explore the relationship be- 
tween those ratings and our graduates’ performance evaluations dur- 
ing medical school. 

In the spring of 1997, the University of Michigan Medical 
School (UMMS), in Ann Arbor, Michigan, began a longitudinal 
follow-up program designed to collect residency directors’ assess- 
ments of the performances of our graduates at the end of their first 
year of residency. This investigation was, in part, inspired by the 
Liaison Committee on Medical Education (LCME) statement that 
medical schools must evaluate the effectiveness of educational pro- 
grams and document graduates’ achievement, shov/ing the extent 
to which institutional and progtam purposes arc met."^ Initiation of 
this study coincided with ihe completion of an extensive and in- 
cremental curricular change. The goals of the curricular change, 
reflecting changes in educational goals, included more opportuni- 
ties for clinical applications of medical science and hands-on, ac- 
tive learning in the first two yeari. Extensive efforts were made to 
encourage collegiality and professionalism among students, and 
more frequent and earlier patient encounters to promote a more 
humanistic, patient-ceniered approach to medical decision making. 
The evaluation system was also revised to pass/fail grading in the 
first year, with additional mechanisms implemented to ensure ear- 
lier and increased feedback to students from objective measures 
throughout the first two years of medical school. 

An important goal of this research project was to validate the 
system used to assess students’ performances in medical schawl by 
comparing the medical school's assessments with performance as- 
sessments of UMMS graduates early in their residency education. 
In particular, wc wanted to assess the contributions of academic 
assessments at various intervals during medical school to ratings of 
residency performance across all students and by subgroups based 
on academic achievement, gender, and ethnicity. 

Method 

To collect residency directors’ ratings of our graduates’ skills and 
abilities, we developed an instrument representing various domains 
of medical practice and aligned with the key goals of our revised 
curriculum. The .seven domains included in the instrument were 
clinical judgment, patient management, clinical skills, professional 
qualities, humanistic qualities, oral and written presentation skills, 
and a final overall performance assessment question. The sur\’ey 
instrument used a five-point Likert-c^'pe response format (1 - poor, 
2 = fair, 3 = good, 4 = very good, and 5 = excellent). Reyj^^y 
directors were also asked to make written narrative comments on 
the instrument. Our intention was to construct an instrument that 



would be self-explanatory and that could be completed in five 
minutes or less. Curriculum committee members approved the fi- 
nalized survey. 

The survey was mailed to the residency directors of the UMMS 
graduating classes of 1996, 1997, and 1998 in May of the graduates’ 
first year of residency. Responses were categorized by each graduate’s 
residency specialty type and by his or her program’s affiliation with 
cither a comm unity- based or a university-based hospital. 

Medical school assessments considered in the analyses included 
overall grade-point average (GPA) of the second medical science 
year (M2), U.S. Medical Licensing Examination (USMLE) Step 1 
scores, overall grade-point average of the seven required clerkships 
in the third (clinical) year (M3), USMLE Step 2 scores, and a 
cumulative composite score at graduation. This composite score at 
graduation was composed of a grade-point average computed over 
all second', third-, and fourth-year courses, with a small fraction 
representing USMLE Step 1 and Step 2 scores, using the formula 
of medical school cumulative grade-point average (GPA) + 
[(USMLE I + USMLE 2)/4,000). 

The structure of the instrument was assessed using principal-com- 
ponents factor analysis. Cronbach’s alpha was used to determine 
the instrument’s internal consistency. Responses were initially an- 
alyzed by graduating class. Demographic, program, and academic 
achievement variables were compared to determine representative- 
ness of responses. Descriptive statistics tor the individual items on 
the surv'ey were compared based on the residency program’s affili- 
ation, specialty subgroup, and the gender of the graduate. A lack 
of differences among individual graduation years allowed the com- 
bination of data from all three years. Correlations were computed 
betw'een measurc.s of medical school performance and directors’ rat- 
ings. .A one-w'ay analysis of variance (ANOVA) w*as used to com- 
pare subgroup means, utilizing post-hoc tests for mean ditferences. 

Results 

A single mailing of the survey instrument was sent to residency 
directors of 498 graduates of three consecutive graduating classes, 
and 338 (68%) were returned. The graduates represented by direc- 
tors’ responses were 61% men and 39% w'omen. The residents’ ra- 
cial-ethnic subgroups were Asian (16%), underrepresented minor- 
ity (15%), and white and all others (69%). The residents’ specialty 
subgroups w'ere primary care (50%), surgery and surgery subspe- 
cialties (27%), and all other specialties (23%). TTe 136 graduates 
not represented by directors’ responses were statistically similar in 
distribution by gender, ethnicity' group, average overall M2 GPA, 
average overall M3 clerkship performance GPA, and average 
USMLE Step 1 scores. The return rate from directors of surgery' 
subspecialties was lower than those of other residency specialty 
groups (chi-square = 10.2, p < .002). 

Across all responses, the average ratings for individual surv'cy 
items were above 4.0 (very good), with the highest average ratings 
given for the items assessing humanistic and professional qualitie.s. 
Although several content areas were included in the instrument, 
factor analysis of the domains represented by the instrument’s seven 
items demonstrated a single factor, explaining 74% of the variance 
in scores. Internal consistency of the items in the single factor was 
high (Cronbach alpha of .94). These findings suggested that the 

- 26 



SI 5 



Academic Medicine, Vol. 75. No. 10 /October Supplement 2000 



Table 1. Pearson Correlations* between University of Michigan Medical 
School Performance Evaluations and Overall Performances Assessed by 
Residency Directors for the Graduating Classes of 1996, 1997, and 1998 













Overall 

Graduation 




M2 


USMLE 


M3 


USMLE 


Composite 




GPA 


Step 1 


GPA 


Step 2 


Score 


USMLE Step 1 


.82 










M3 GPA 


.63 


.58 








USMLE Step 2 
Overall graduation 


.69 


.81 


,64 






composite scoref 
Overall performance^ 


.84 


.72 


.85 


.70 




(residency) 


.20 


.20 


.41 


.24 


.32 



• All Pearson correlations significant (p < .000), n = 338. 
t Overall graduation composite score is composed of a grade- point average computed 
over all second-, third-, and fourth-year courses with a smali fraction representing 
USMLE Step 1 and Step 2 scores, using the formula medical school cumulative 6PA + 
KUSMLE Step 1 -h USMLE Step 2)/4,000]. 

t Overall performance is a single item representing the seven domains included in the 
survey completed by residency program directors. 



surv'ey was measuring the directors’ singular perceptions of the res- 
idents’ performances. A decision was made to use the instrument’s 
final item, “overall performance,” to represent directors’ assessments 
in all further analyses. 

Inter-item correlations were high (p < .000) between the indi- 
vidual grading indices during medical school (M2 GPA, M3 GPA, 
USMLE Step 1 scores, USMLE Step 2 scores, and overall cumu- 
lative composite score). (See Table 1.) The correlation between 
M3 clinical grades and the overall performance item assessed by 
program directors was stronger (r = ,41) than that between the 
composite cumulative grade ar graduation (r = .32). The conelation 
between M3 grade average and overall residency performance rating 
was nearly twice the magnitude of the correlations between M2 
overall GPA, USMLE Step 1, or Step 2 scores and the overall 
residency performance. (See Table L) When wc looked at the in- 
tcr-item correlation between the various medical school assessment 
components and the se\'en individual domains of our instrument, 
we found the relationships to be positive and statistically significant 
for all individual domains except one; humanistic qualities, assessed 
by residency directors, was not related to overall M2 grades (r = 
.07, p = .12). 

Another analysis examining the relationship of undergraduate 
medical school grades to assessments of residency performance com- 
pared subgroups composed of thirds of the class, based on an overall 
composite score at graduation. Performance of graduates who had 
been in the top third of their class, on average, was rated higher 
than was performance of those who were in the lowest third of the 
graduating class (see Table 2). This relationship held when com- 
paring top and lower thirds based on all medical school assessment 
components considered in our study (M2 overall GPA, USMLE 
Step 1, M3 overall GPA, and USMLE Step 2 scores.) The greatest 
differences between groups were found when comparing thirds of 
the class based on M3 overall GPA. Statistically significant differ- 
ences were found when comparing directors’ mean ratings of overall 
performances between those in the lowest and middle thirds, and 
again when comparing the middle third’s with the top third’s av- 
erage ratings (p < ,05). 

Comparisons of our graduates’ ratings by gender, by residency 
specialty (grouped by primary care, surgery^ and surgery subspecial- 
ties, and all other subspecialties), and by residency program affili- 
ation (cither community-based or university-based residency pro- 
grams) showed no difference, on average, for overall residency 
performance. When the race-ethnicity of graduates was considered, 
using the three subgroups of underrepresented minori^^-^rudents; * 
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Table 2. Comparisons of Residency Directors* Mean Assessments o1 
Overall Performance by University of Michigan Medical School Grading 
Components. Glasses of 1996, 1997, 1998* 



Performance Level and Directors’ Ratings 



Undergraduate 

Performance 

Measures 


Lowest 

Third 

Mean (SD) 


Middle 

Third 

Mean (SD) 


Top 

Third 

Mean (SD) 


M2 grade-point 


4.08 (0.81) 


4.14 (0.79) 


4.42t (0.76) 


average 


n= 107 


n = 97 


/?= 101 


USMLE Step 1 score 


3.97 (0.79) 
/7= 107 


4.34 (0.68) 
n = 104 


4.32$ (0.84) 
/7= 117 


M3 grade-point 


3.80 (0.92) 


4.12 (0.73) 


4.61§ (0.58) 


average 


n = 98 


n= 125 


/7=111 


USMLE Step 2 score 


4.04 (0.81) 
n = 114 


4.17 (0,75) 
110 


4.39t (0.75) 
n= m 


Overall graduation 


3.94 (0,88) 


4.15 (0.74) 


4.50t (0.73) 


composite score 


/)= 107 


/?= 118 


n = 112 



• Using this table, fo; „AampIe. students whose medical school GPAs were in the 
lowest third received mean ratings of 4.08 from ttieir residency directors, those receiving 
GPAs in the middle third received mean ratings ot4.14, and those v/ith the highest GPA 
received mean ratings of 4.42. Mean ratings from residency directors were based their 
responses to a summary item “overall performance” on an 8-item questionnaire using 
a 5-point Likert-type rating scale (1 = poor. 5 = excellent). 
tTop third differs from lower third and middle third, p < .05. 
i Lower third differs from middle and top third, p < .05. 

§Ail groups differ, p < .05. 



Asian students, and all other students, no difference was found in 
the comparison of program directors’ ratings of overall residency 
performance with cumulative composite scores at graduation. Re- 
gardless of their racial-ethnic subgroups, the students in the top 
third of the class, based on the cumulative composite score, were 
rated higher by program directors than were rhe students in the 
lowest third. 

Discussion 

Concerns expressed about the participation rates of residency di- 
rectors at the onset of this project w'ere dispelled. Our relatively 
high response rate without follow up is consistent with other re- 
searchers’ efforts,’ '* and it provides evidence that residency directors 
are willing to provide assessments of graduates’ performances and 
feedback to medical schools regarding graduates. 

Finding that the survey measured essentially one dimension of 
our graduates’ early residency performance was consistent with find- 
ings of other studies.** Unlike our medical school’s composite index, 
which was computed from many individual measures, the program 
directors were providing ratings on single items. It is possible that 
the residency directors based their ratings on a single overarching 
impression of our graduates that spilled over into ratings of perfor- 
mance in all domains, rather than making distinctions of the 
strengths and weaknesses of individuals.^’ just as our medical school 
combined performance measures across multiple courses and learn- 
ing experiences in an “overall” percentage of GPA index for our 
students, the residency directors in our study tended to make global 
assessments of the residents’ performances rather than distinctions 
among the items in the survey. 

We were encouraged thar our graduates were rated, on average, 
as “very good” or higher by residency directors. The consistency of 
our graduates’ ratings, across specialty areas and regardless of uni- 
versity- or community-based program affiliation, provided confir- 
mation that our graduates are prepared and adaptable to medical 
practice in a variety of settings. 

As expected, positive and relatively high correlations were found 
among the grading components during medical schoed (M2 GPA, 
,M3 composite scores, cumulative compo.site score, and USMLE 
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Step I and Step 2 scores). While low in magnitude, the correlations 
between residents’ medical school grades and their residency direc- 
tors’ assessments were statistically significant. Although these find' 
ings support the relationship between medical school achievement 
and later performance, academic performance in this study ex- 
plained less than 20% of the variance in overall residency perfor- 
mance. Academic assessments of this type in medical school do not 
appear to be capturing other imp>ortant factors contributing to di- 
rectors’ assessments after graduation.^** The strength of the corre- 
lation between the M3 GPA and the residency directors’ assess- 
ments may have been due to a “method effect” of the ratings 
provided.^' Just as the residency directors made largely subjective 
assessments of our graduates, the majority of the overall clerkship 
grades for the required clerkships are provided by attendings’ ratings 
of students’ clinical performances. It is possible that the number of 
students in a residency program and the degree of familiarity be- 
tween the residency director and the graduate may have contrib- 
•uted to the rating patterns. 

Combining data across three years achieved an “n” large enough 
to compare a variety of subgroups. We found that the students 
represented in the top thirds of their classes, for all academic mea- 
sures in this study except the USMLE Step 1 scores, were rated 
higher by residency directors, on average, when compared with the 
students in the middle and lowest thirds of their classes. Average 
ratings based on thirds of the class by clerkship performance in the 
M3 year proved to be the most consistent with the residency di- 
rectors’ average ratings of our graduates’ performances. Further, sub- 
group analyses showed that medical school performance accounted 
for the difference in program directors’ ratings, regardless of a grad- 
uate’s gender or race-ethnicity'. 

While it may be intuitive that quality of performance after med- 
ical school relies on quality of achievement before graduation, our 
findings provide evidence to support this. We were able to dem- 
onstrate a correspondence between students’ performances in com- 
ponents of our school’s evaluation system and residency directors.' 
ratings of their subsequent performances. The relationship we found 
between academic achievement during medical school and perfor- 



mance in residency lends validity to the evaluation system utilized 
by our medical school, and supports the use of these postgraduate 
outcomes as measurements of educational programs. Identifying 
standardized, objective measures that could be utilized as an index 
of residency performance, similar to those used by our medical 
school evaluation system, might enhance the value of residency 
performance ratings as an educational outcome. 

The findings of this study are additionally important in increas- 
ing our understanding of factors that do not appear to contribute 
to performance racings in graduate education. Based on our data, 
specialty ty'pe, gender, and race-ethnicity of graduates, when aca- 
demic achievement was taken into account, were not contributing 
factors in residency performance ratings. Discovering and measuring 
contributing factors other chan those included in our evaluation 
system is our challenge in medical education. 

Correspondence: Gwen L. Alexander, PhD, University of Michigan Medical Schixd, 
Department of Medical Education, G1211 Towslcy Center. Ann .Arhor, Ml 
0201. 
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# MAKING THE CUT 



Moderator: Susan Case, PhD 



The Impact of an Alternative Approach to Computing Station Cut Scores in an OSCE 

JODI HEROLD MclLROY 



Tile OSCE is gaining widespread recognition as a valid means of 
assessing cntry-tO'practice competence, or eligibility for licensure, 
of physicians, physiotherapists, and other health professionals. 
Given the high'Stakes nature of these licensure OSCEs, robust psy' 
chomctric properties of the exams arc essential. One of these prop- 
erties is the resistance of cut scores used in determining pass-fail 
decisions to such sources of error as differences in examiner per- 
ceptions of competence and examiner stringency in judging com- 
petence. 

A number of standard-setting methods have been described in 
the literature on performance-based assessment. Methods are typi- 
cally categorized as relative or absolute/'" with most administrators 
responsible for high-stakes examinations preferring absolute or cri- 
terion-referenced methods. Absolute standard-setting methcxls 
ccjmpare candidates’ performances with an externally determined 
or defined measure (criterion) and are ty’pically categorized as test- 
centered or examinee-centered. These categories distinguish 
methods according to whether judgments about competence are 
based primarily on inspection of test items (e.g., Angoff, Ebel, and 
similar methods) or on judgments about examinees (e.g., contrast- 
ing groups, borderline group). The common elements for all meth- 
ods include (1) use of expert judges and (2) reference* to a 
hypothetical "minimally competent*’ person or a hypothetical "bor- 
derline competent” perfomiancc.” Descriptions and classifications 
of standard-setting methods can be found in review articles by Ci- 
zek,‘ Berk,* and Cusimano.** 

A modification of the mcan-borderline-group method that is 
now being employed by a number of credentialling agencies entails 
identification of a subgroup of candidates actually performing the 
exam who are identified by the examiners as having a level of 
clinical competence that is just on the borderline between being 
competent and not being competent. The station scores for this 
borderline group are averaged to generate the station cut score. In 
this approach, the rating of candidates’ performances as competent, 
borderline, or not competent is concurrent with completion of 
checklists and/or other scoring nibrics by the.se same examiners. 

This modified mcan-borderline-group method has potentially in- 
teresting implications for the determination of cut scores when 
large-scale, multi-site examinations arc employed. The cut score is 
calculated as the mean of the scores of all candidates who receive 
borderline ratings, regardless of site of administration (or exam- 
iner). Thus, in a multi-site examination where examiners are nested 
w'ithin sites, an examiner who identifies a greater number of "bor- 
derline competent” candidates during the exam has a greater influ- 
ence than other examiners on the resultant cut score for that sta- 
tion. The inequality of examiners’ influence over station cur scores 
is inconsistent with other standard-setting methods described, such 
as the Angoff and Ebel methods, and could be problematic when 
combined with the potential for examiners’ differing percept ion.s of 
what constitutes a borderline performance. 

The proposed alternative to this current approach is one in 
which every examiner’s opinion or concept of borderline compe- 
tence is weighted the same. In other words, the mean of each ex- 
aminer’s borderline group is calculated first, then the mean across 
examiners evaluating the same station is calculated to determine 
the cut score. The impacts of individual examiners on the resultant 
cut .scote arc thus equalized. The effect of the alternative method 
on cut .scores and the practical impact on pass-fail decisions was 
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explored in order to determine whether further investigation of cut 
score validity is required. 

Method 

Data for 1,373 candidates who participated in four administrations 
(years) of an OSCE used in a national physiotherapy examination 
were used in the study. Each administration of the OSCE consisted 
of 20 stations in which candidates were required to perform a clin- 
ical skill in the context of a clinical scenario. Results from two 
stations had been temoved for administrative reasons, so these re- 
sults were not included in the data set provided. Therefc'>re, the 
scores for a total of 78 stations were used in the study. 

Candidates rotated through circuits of ten ten-minute stations 
and ten five-minute stations. Each site consisted of two ten-minute 
circuits per five-minute circuit, or two examiners of each ten-min- 
ute station, and one examiner of each five-minute station. Each 
year had a median of 40 candidates per site. (Therefore, as a rule, 
ton-minute station examiners evaluated 20 candidates and five- 
minute station examiners evaluated 40 candidates.) 

There were seven, five, 12, and 14 sites of administration in the 
four respective years of the exam. Each candidate was allowed to 
choose the sites at w’hich he or she participated, so assignment of 
candidates to sites was not a random process and will likely have 
been influenced by location of training. The examiners were cli- 
nicians from the local community. They were assigned to stations 
according to their self- identified areas of clinical expertise. Tliey 
attended a training session where they oriented to the examination 
procedures and scoring processes before the exam. 

Scoring of the OSCE. The candidates performances were rated 
using a task-specific dichotomous checklist where clinician exam- 
iners record whether key behaviors arc demonstrated correctly,* In 
addition, overall performance was rated on a six-point rating scale. 
The two middle anchors (3 of 6 and 4 of 6) on this scale were 
"borderline unsatisfactory” and "borderline satisfactorv'.” The bor- 
derline group used in computation of cut scores is considered to 
include all candidates assigned either of these two scores. The over- 
all rating of performance was considered for the sole purpose of 
identifying borderline candidates to calculate cut scores. 

CompMtat/on of Cut Scores. The traditional approach entailed 
finding the mean checklist score for all candidates identified as 
borderline, regardless of site or examiner. The alternative approach 
entailed finding the examiner-specific mean checklist score for the 
borderline group, then cc^mputing the a\ erage of these means across 
examiners. This second method, in effect, weights all examiners’ 
opinions equally. These two chccklist-bascd cut scores were com- 
puted for the 78 stations used over the four examinations. 

In addition to the overall examination score, candidates arc 
required to perform satisfactorily in a criterion number of sta- 
tions to pass. The number of stations required to pass an exami- 
nation fluctuates from year to year, depending on the level of diffi- 
culty of the examination. In the hypothetical situation coastructed 
for the purpose of these analyses, I used 12- and 13-station cri- 



cut SLurt- tr*, in practice, applied tn a staium M,i*rr ilvu i** a composite of a 
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teria to examine two different scenarios for the impact of using 
the alternative method of computation on exam-level pass-fail 
decisions. 

Results 

When I examined descriptive data for examiner patterns with re- 
gard to use of the “borderline competent” rating, the findings were 
very consistent for the five-minute and ten-minute stations. For all 
analyses presented, the results arc pooled across the 78 stations to 
maximize power. 

There were observed differences in the proportions of candidates 
deemed borderline by different examiners examining the same sta- 
tion. The mean discrepancy (the range in the proportions of can- 
didates identified as borderline by different examiners of a given 
station) across the 78 stations was 48%, with rhe lowest discrepancy 
being 11 % (where an examiner at one site ider^tified no candidate 
as borderline, and an examiner at another site identified 11 % as 
borderline) and the highest discrepancy being 90% (with one ex- 
aminer identifying no borderline candidate and another identifying 
90% of candidates as borderline). Clearly, in the computation of 
cut scores, some examiners are contributing substantially more bor- 
derline candidates than others. 

The examiner-specific cut scores (the mean checklist score for 
borderline candidates at a site) also displayed within-station ranges. 
For example, there was a cut-score discrepancy of 56% (of total 
possible checklist points) between two examiners of a given station 
on two of the 78 stations examined. Thirty-five of the 78 stations 
(45%) had cut score ranges of 30% or more across examiners. 

Thus, when discrepancies in examiner-specific cut scores were 
considered in conjunction with discrepancies in the proportions of 
candidates rated by individual examiners as borderline there was 
high potential that, in this hypothetical case based on checklist 
scores only, use of the proposed alternative method of computation 
could lead to ver>' different results, at least at the level of station 
cut scores and pass-fail decisions. The remaining analyses assessed 
the impact of the observed variability among examiners in their 
applications of the borderline rating on e.^am results. 

Impact on Raw Cut Scores. The two computation methods gen- 
erated, very similar ranges of station cut scores across the 78 sta- 
tions. I’he traditional method resulted in cut scores ranging from 
37.03% for the most difficult station to 87.13% for the easiest, 
while the alternative method resulted in cut scores ranging from 
35.37% to 86.93%. At the level of station, differences between the 
two cut scores were strikingly small, with the maximum difference 
in cut scores between the two methods being 4-73%. Differences 
between cut scores were relatively normally distributed around a 
mean difference of 0.31. There was no significant difference be- 
tween the cut-scores using the two different approaches (paired C 77 
= 1 . 66 , p = . 10 ). 

Also worth noting is the fact that of the 78 stations, 22 were 
used on more than one occasion. From these stations, it was pos- 
sible to assess rhe stability of cut scores over time. The absolute 
difference in cut scores for the two iterations of each station was 
calculated for each method. The rwo methods showed equal le\'els 
of cut-score stability of the 22 stations over multiple occasions, in 
that there was no significant difference in the absolute difference 
score (reflective of cut score change over time) between methods 
(paired t 2 i = l.0l,p = .33). 

Impact on Station-level Pass-Fail Dedsions. There was an equally 
small effect on pass-fail decisions made at the level of station. For 
58 of the 78 stations (74%) there was perfect concordance of pass- 
fail decisions made using the two methods (i.e., failure rates were 
unaffected by use of the alternative method). For the 20 stations 
that were affected, the alternative method increased failure rates at 
13 stations while decreasing rates at seven stations. Changes in 
failure rates for the 20 affected stations ranged fmm 2 % to 18% of 



candidates examined. There was, however, no significant difference 
in station failure rates between methods (paired = 0.78, p = 
.44). 

When the 22 repeat-use stations were examined for stability of 
station-level pass-fail decisions, the alternative computation 
method did not result in a substantial practical effect. The absolute 
differences between failure rates at two occasions of station use were 
no different when compared for the two methods (paired = 0.24, 
p = .82). 

Impact on Exam-level Pass-Fall Dedsions. The examination-level 
pass-fail decisions are made on the basis of the i-iumber of stations 
passed. In other words, their station scores must he above the sta- 
tion cur score on, say, 12 of 20 stations. In the hypothetical situ- 
arion created for the study, candidates were required to meet a 
criterion of passing 12 or 13 stations (both scenarios w'ere exam- 
ined), based on historical precedent at this and other similar testing 
organizations. When candidate performance data were used to ex- 
amine whether the effect of using the alternative cut-score com- 
putation method on station-level decisions would translate into an 
effect at the level of the entire examination, ver>' little impact w'as 
seen. The exam-level agreement rates for the two methods (i.e., 
the proportion of candidates where the exam-level pass-fail deci- 
sion was unaffected) ranged from 95% to 98% for the four exams, 
with kappa coefficients ranging from 0.88 to 0.96. Pooled across all 
four years of administration, the agreement rates were 96% for a 
13-station criterion and 97% for a 12-station criterion. 

Discussion 

There are a number of standard-setting methods described in the 
OSCE and performance-based-assessment literature. The currently 
used modification of the mean-borderline-group method of com- 
puting station cut scores in mulri-site examinations is the only 
method that gives unequal weighting to judges (examiners) as a 
result of the practicalities of implementation. The observed ranges 
of examiner-specific cut scores, combined with the differences in 
proportions of borderline candidates identified by different exam- 
iners of the same station, open up the potential for individual ex- 
aminer(s) to influence the cut score in a manner inconsistent with 
the opinions of the other examiners. This is a sharp contrast with 
most other standard -setting models, where all experts’ judgments 
are weighted equally. Tliis study proposed an alternative method 
that attempts to correct for this imbalance and examined the prac- 
tical implications of using this alternative. 

Given the obser\'ed variability of examiners in the application 
of the borderline rating, there was surprisingly lirtle impact when 
empirical data were subjected to the alternative computation 
method. There was remarkable consistency between the cut scores 
generated by the two methods within stations, with the largest ob- 
ser\'ed difference being only 5%, or the equivalent of no more than 
two checklist items. The small differences translated to similarly 
consistent pass-fail decisions at the level of individual stations. Of 
the 78 stations on which the two methods were compared, deci- 
sions were unaffected at 58 (74%). Furthermore, neither raw cut 
scores nor station-level failure rates were systematically affected by 
use of the alternative method. That is, equal weighting of examiner 
opinions did not consistently result in more or less stringent cut 
scores. 

The already small effect was further attenuated at the level of 
pass-fail decisions for the four 19- or 20-station examinations. 
Concordance rates between the two methods were very high, with 
final decisions being unaffected for 97% of all candidates included 
in the analyses. It appears that because there is no .systematic effect 
of the alternative method at the station level, increases in failure 
rates at one station are being counteracted by decreases at another 
star ion in rhe same examination. 

Further, equal weighting of examiner opinions does not influence 
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the reproducibility of station cut scores (and the resultant failure 
rates) across testing occasions. The degree to which failure rates 
changed from one use of a given station to the next was not sys- 
tematically altered by introduction of the new computation 
method - 

It should be noted that this study examined only the effect of 
the alternative computation method on checklist-based ratings of 
candidate performance. Station composite scores, which the liter- 
ature suggests are more reliable,^ often form the basis for both sta- 
tion-level and exam-level decisions. The station composite scores 
were not examined in this study due to the complicating effects of 
measuring multiple constructs with multiple scoring rubrics. The 
extent to which the findings on checklist scores would be replicated 
on station composites is still untested but may be of interest to test 
developers and administrative bodies that use composite scores to 
assess performances on multidimensional examinations. 

The extent to w'hich the observed differences in borderline rac- 
ings are related to differences in the candidates’ abilities across sites 
or differences among examiners in their use of the “borderline com- 
petent” rating may be of interest to some but is essentially an ac- 
ademic argument. Reasons for observed variations in the frequen- 
cies of borderline ratings used by examiners have not been studied 
to date, and could not be determined through the study design used 
in this project. Variability in applying the borderline rating was in 
fact observed in the data set used for the study. For these data, 



despite differences among examiners, the practical implications of 
weighting their opinions according to liberality of use of the “bor- 
derline” rating do not suggest a need to change current practices 
in large-scale, multi-site OSCEs. Given the equivocal psychometric 
benefits of one approach versus the other, a decision about which 
computation method should he employed when using the mean- 
borderline-group technique should be based on philosophical and 
practical rationales. 
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• MAKING THE CUT 



Moderacor: Su^an Cose, PhD 



An Investigation of the Impacts of Different Genevalizability Study Designs on Estimates of Variance 

Components and Generalizability Coefficients 

L. A. KELLER. K. M. MAZOR, H. SWAMINATHAN. and M. R PUGNAIRE 



In recent years, performance assessments have become increasingly 
popular in medical education. While the tenn “performance aS' 
sessment'’ can he applied to many different types of assessments,' 
in medical education this term usually refers to some sort of sim- 
ulated patient encounter, such as an objective structured clinical 
examination (OSCE) or a computer simulation of an encounter. 
These types of assessments appeal to many educators because the 
tasks or items used are often seen as more realistic than items on 
multiplc'choice examinations. However, this increased “realism” or 
apparent authenticity comes at a cost — performance examinations 
are typically more time-consuming and expensive both to admin- 
ister and to score. On an OSCE, each encounter with a standard- 
ised patient is typically scored as a single item, often resulting in 
an examinee’s completing only four to eight items in a two-hour 
testing period. In contrast, an examinee might complete 100 to 
150 items during a cw'O'hour multiple-choice examination. 

The fact rhat performance examinations are typically relatively 
short means that test users must pay particular attention to the . 
reliability' and validity of test scores. In general, other things being 
equal, a shorter test will result in scores that are less reliable than 
a longer test. Lower reliability reflects greater error. Adding more 
items is one way that test developers may increase reliability. On 
a multiple-choice test, it is relatively inexpensive to write and ad- 
minister additional items. However, on a performance test both the 
development and administration of even a single new item can be 
expensive, and often must be justified in terms of expected gains 
in score precision. 

A second consideration in performance examinations is that 
scoring is ty'pically more difficult and expensive than scoring of 
multiple-choice examinations. Expert or trained raters are generally 
required to review each performance or a sample of performances. 
Such ratings may he used to score specific performances or to de- 
velop scoring criteria or weighting schemes. In either case, raters 
are a potential source of error. 

Generalizability theory^ provides a framework for estimating the 
relative magnitudes of various sources of error in a set of scores. In 
most performance assessments, both items and raters are potential 
sources of error. Generalizability theory allows estimation of the 
error associated with each of these sources separately, as well as the 
relevant interaction effects. In a generalizability study (G study), 
the variance in a set of scores is partitioned in a manner similar to 
that used in the analysis of variance. However, in a G study the 
emphasis is not on testing for statistical significance, but rather on 
assessing the relative magnitudes of the variance components. De- 
pending on the study design, different variance components can be 
estimated. Once the variance components arc estimated, additional 
analyses can be conducted. In the framework of generalizability 
theory, the second stage of analysis is referred to as a decision study 
(D study). In a D study, the estimated variance components are 
used to estimate generalizability coefficients (comparable to reli- 
ability coefficients) under various measurement conditions. Thus, 
using the results from a single test administration, it is possible to 
estimate the impacts of changing both the number of rat^'jind 
the number of items. This is an important benefit of conouc'ting 
analyses based on generalizability theory. However, it must be 
stressed that the variance components and G coefficients are esti- 
mates, and as such will vary depending on the specific sample 



Given that the results of generalizability analyses are often used 
to make practical decisions about test implementation, it is impor- 
tant to collect the data for a G study in a way that will maximize 
the precision of the variance-components estimates. Given also 
that performance assessments are costly to administer and score, 
and that resources (time, raters, and money) are typically limited, 
the question of how available resources should be allocated for a 
G study -is an important one. Is it preferable to collect data from 
100 examinees on 16 items, or 200 examinees on eight items? 
Should four raters score 50 examinee performances, or should two 
raters score 100 performances? Decision studies may help to inform 
these types of decisions after the data are collected and analyzed, 
but D studies are based on G studies. To date there is no research 
we are aware of to help in planning data collection for a G study, 
especially under constraints. 

The purpose of the present study was to examine the impacts of 
different G-study designs. All of the designs simulated here contain 
the same number of data points, but the distributions of the data 
points over examinees, items, and raters are varied. By starting with 
a relatively large data set (200 medical student examinees, com- 
pleting 16 items each, scored by four raters each for a total of 
12,800 data points), we were able to conduct repeated sampling of 
different data-collection conditions and to construct empirical con- 
fidence intervals for variance components estimates. Computed 
confidence intervals were also constructed^ and compared with the 
empirically constructed intervals. A series of D studies was then 
conducted to illustrate how different sampling strategics and dif- 
ferent samples within those strategies could have substantial im- 
pacts on the decisions that would be likely to be made based on 
such analyses. It should be stressed that the focus of this study was 
to illustrate the impacts of various sampling strategies, rather than 
to make decisions about this particular data set. We hope to inform 
and remind test designers and users that estimates are based on 
samples, and as such contain variability, and to illustrate the extent 
to which that variability is greatly affected by the data-collection 
procedure used. 

Method 

Data. The data set used here, hereafter referred to as the “full 
sample,” consisted of four expert ratings of 200 medical students 
on 16 performance items related to a computer simulation. Each 
examinee performance was rated by each of the four independent 
raters on a holistic nine-point rating scale. From this data set, sam- 
ples were selected to five data-collection designs or conditions. The 
numbers of persons or examinees (P), items (I), and raters (R) for 
each condition were as follows: condition I, P = 25, I = 16, R = 
4; condition 2, P = 50, I = 8, R = 4; condition 3, P = 50, 1 = 16, 
R “ 2; condition 4, P = 100, I = 4, R = 4; condition 5, P - 100, 
1 = 8, R = 2. These five conditions were chosen so that all s; mples 
contained the same total number of observations (1,6(00). Wlaile 
many other possible combinations were possible, it was beyond the 
scope of the present study to investigate every possible design. 
These five conditions were considered representative and realistic. 
One hundred replications were conducted for each condition in 
constructing the empirical confidence intervals. For the computed 
C(^^(^ce intervals for conditions 1 through 5, one sample was 
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Table 1. Empirical and Computed 95% Confidence intervals for the Percentage of Variance Accounted tor by Each Source by Condition, 

University of Massachusetts Medical School. 1999-2000 



No. of examinees: 
No. of items: 

No. of raters: 
Source 


Condition 1 
25 
16 
4 


Condition 2 
50 
8 
4 


Condition 3 
50 
16 
2 


Condition 4 
100 
4 
4 


Condition 5 
100 
8 
2 


Full Sample 
200 
16 
4 


Person (P) 


14.5% 


14.2% 


14.9% 


14,8% 


15.1% 


14,9% 


Empirical 


(6.0, 26.9) 


(5.4, 22.8) 


(8.9, 21,3) 


(6.1, 27.2) 


(7.4, 23.6) 




Satterthwaite 


(8.1, 33.1) 


(5.5. 22.7) 


(8.7, 24.0) 


(9.1,29.7) 


{24.2, 56.8) 


(11.9, 19.4) 


Item (1) 


11.9% 


13.0% 


11.8% 


13.3% 


12.0% 


11.7% 


Empirical 


. (6.7. 17.7) 


(.9. 25.1) 


(6.8. 16.3) 


(.2, 36.2) 


(1.2. 25.9) 




Satterthwaite 


(6.1,34.4) 


(5.5, 45.1) 


(8.7, 43.8) 


{16.1, 39.0) 


(9.4, 74.7) 


(5.8, 26.1) 


Rater (R) 


0.7% 


0.7% 


0.9% 


0.5% 


0.7% 


0-7% 


Empirical 


(.3. 1.3) 


(0. 2.1) 


(-.1.2.6) 


(-.3, 2.3) 


(-.4, 3.4) 




Satterthwaite 


(.3, 2.6) 


{.8, 5.5) 


(.6, 3.6) 


(-. 2 , oy 


{0. 0) 


(.3, 2.3) 


PI 


52.2% 


52.4% 


51.4% 


52.2% 


51.5% 


52.2% 


Empirical 


(45.5, 60.2) 


(41.0. 65.4) 


(41.6. 62.7) 


(34.9, 71.7) 


(37.1, 67.9) 




Satterthwaite 


(45.0. 62.0) 


(49 1. 67.5) 


(4t1. 52.9) 


(39.5, 56.2) 


{25.5, 32.3) 


(50.1, 55.9) 


PR 


.08% 


0.1% 


0.1% 


0% 


0% 


0.1% 


Empirical 


(-.4, .4) 


(-.4. .7) 


(-.3, .6) 


(-.8, .8) 


(-.6,1.1) 




Satterthwaite 


(0, .3) 


(-.5, oy 


(0. 1.2) 


(-2.3. -.2)* 


(0,3) 


(0. .3) 


IR 


2.4% 


2.4% 


2.5% 


2.3% 


2.7% 


2.4% 


Empirical 


(1.5, 3.5) 


(.9, 4.3) 


(.8, 4.5) 


(.6, 5.6) 


(.5, 6.2) 




Satterthwaite 


(1.4, 4.3) 


(1.8, 6.9) 


(1.5. 7.5) 


(.5. 3.9) 


(.3 2.3) 


(1.4, 3.5) 


Residual 


18.1% 


17.3% 


18.4% 


16.9% 


18.0% 


18.0% 


Empirical 


(15.7, 20.7) 


(14.8. 19.6) 


(14.3, 23.3) 


(12.9, 21.5) 


(13.7, 22.6) 




Satterthwaite 


(16.7, 19.9) 


{14.8, 17.7) 


(17.1, 21.0) 


{12.7, 15.2) 


{13.3, 16.4) 


(17.7,18.8) 



*0ue to negative estimates of this variance component, the confidence interval computed using Satterthv^aite’s technique is not appropriate and should not be interpreted. Italicized 
confidence intervals do not contain the variance percentage found with the full sample. 



selected at random, and computations were based on that single 
sample. 

Arw/ysis. For each of the 500 samples, and for the full data set, 
a person X item X rater (p X i X r) G study was performed, and 
variance components were estimated using GENOVA.’ The 100 
replications of each sampling condition provided an empirical sam- 
pling distribution for each of the variance components and allowed 
empirical estimation of means, standard deviations, and 95% con- 
hdence interv'als for each variance component. The percentage of 
variance due to each variance component was also calculated, along 
with the appropriate 95% confidence intervals for these percent- 
ages. These empirical confidence intervals were compared with the 
confidence intervals obtained using Satterthwaite’s technique.'^ 

To assess the practical implications of the differences in the var- 
iance components, a series of D studies was conducted. Because the 
results of the G studies suggested that only a small percentage of 
the variance was associated with the rater facet, the number of 
raters was fixed at four for all D studies, while the number of items 
varied from one to .30. Two sets of D studies were conducted for 
each of the five simulated conditions. This was done in order to 
illustrate how results could differ even under the same data-col lec- 
tion design. The specific samples were chosen so that the person 
variance component was at the 10th and 90th percentiles of the 
distribution for that condition, A D study was also conducted on 
the full data set. 

Results 

The results of the G studies using the full data set and the ti\’c 
different conditions are summarized in Table 1. F<^r conditions 1 



through 5, the percentages associated with each variance compo- 
nent represent the averages across the 100 replications. The con- 
fidence inter\'als reported here are based on the empirical distri- 
butions of these percentages in the 100 replications. The 
confidence intervals obtained using Satterthwaitos technique are 
reported below the empirical confidence inten'als. 

Comparing the average percentage of variance associated with 
each of the facets across the five sampled conditions, it appe.ars that 
differences between conditions are minimal. The average percent- 
ages are also very similar to the variance-components estimates ob- 
tained using the full data set. However, because the results for the 
various sampling conditions were ba.sed on the variance compo- 
nents averaged across 100 samples, it is important to consider the 
associated confidence incer\'als, which indicate the variability in 
the sampling distributions. A review of the empirical confidence 
intciA'als suggests differences in the stahilir^’ of the estimates ob- 
tained under various conditions. For example, the widths of the 
empirical confidence inter\^als for the item component range from 
about 9% (condition 3) to 36% (condition 4), suggesting that con- 
dition 3 provides a more stable estimate of the item-variance com- 
ponent. Considering all five sampling conditions, condition 1 pro- 
vides the most stable estimates of four of the seven variance 
components. By contrast, condition 4 provides the least stable es- 
timates of five of the seven components. 

The computed confidence intcrv'als show considerable variability 
across conditions in the widths of rhe intervals and the values of 
the lower and upper limits. Sixteen of the 35 computed confidence 
intcrv'als for conditions 1 through 5 were wider than the empirical 
intervals; the remaining 19 were not. Twelve of the 35 computed 
confidence interv'als did not contain the value of percentage of 





Table 2. Estimates of (2 Coefficients from D studies Based on Variance Components from the 10th and 90th Percentiles of the Five Conditions and the 
Full Data (No. of Raters » 4). University of Massachusetts Medical School. 1999-2000 



No, of examinee 
No. of items: 
No. of raters: - 

No. of items 


Condition 1 
is: 25 

16 
4 


Condition 2 
50 
8 
4 


Condition 3 
50 
16 
2 


Condition 4 
100 
4 
4 


Condition 5 
100 
8 
2 


Full Sample 
200 
16 
4 


10th % 


90th % 


10th % 


90th % 


10th % 


90th % 


10th % 


90th % 


10th % 


90th % 


•1 


.14 


.25 


,15 


.28 


.17 


.24 


.16 


.25 


.14 


.26 


.21 


2 


.25 


.40 


.27 


.43 


.30 


.39 


.27 


.40 


.24 


.41 


.34 


3 


.33 


.50 


.35 


.54 


.39 


.49 


.36 


.50 


.33 


.51 


.44 


4 


.40 


.57 


.42 


.61 


.46 


.56 


.43 


.58 


.39 


.58 


.51 


5 


.46 


.63 


.48 


.66 


.51 


.62 


.49 


.63 


.45 


.63 


.57 


6 


.50 


.67 


.52 


.70 


.56 


.66 


.53 


.67 


.49 


.68 


.61 


7 


.54 


.70 


.56 


.73 


.60 


.69 


.57 


.70 


.53 


.71 


.65 


8 


.57 


.73 


.59 


.75 


.63 


.72 


.61 


.73 


.56 


.73 


.68 


9 


.60 


.75 


.62 


.77 


.66 


.74 


.64 


.75 


.59 


.76 


.70 


10 


.63 


.77 


.64 


.79 


.68 


.76 


.66 


.77 


.62 


.78 


.72 


11 


.65 


.79 


.66 


.81 


.70 


.78 


.68 


.79 


.64 


.79 


.74 


12 


.67 


.81 


.68 


.82 


,72 


.80 


.70 


.80 


.66 


.80 


.76 


13 


.69 


.81 


.70 


.83 


.73 


.81 


.72 


.81 


.68 


.82 


.77 


14 


.70 


.82 


.72 


.84 


.75 


.82 


.74 


.83 


.69 


.83 


.79 


15 


.72 


.83 


.73 


.85 


.76 


.83 


.75 


.83 


.71 


.84 


.80 


16 


.73 


.84 


.74 


.86 


.77 


.84 


.76 


.84 


.72 


.84 


.81 


17 


.74 


.85 


.75 


,87 


.78 


.85 


.78 


.85 


.73 


.85 


.82 


18 


.75 


.86 


.76 


,87 


.79 


.85 


.79 


.86 


.75 


.86 


.82 


19 


.76 


.86 


.77 


.88 


.80 


.86 


.80 


.86 


.76 


.87 


.83 


20 


.77 


.87 


.78 


.88 


.81 


.87 


.81 


.87 


.77 


.87 


,84 


21 


.78 


.88 


.79 


.89 


.82 


.87 


.81 


.88 


.77 


.88 


.85 


22 


.79 


.88 


.80 


.89 


.82 


.88 


,82 


.88 


.78 


.88 


.85 


23 


.80 


.89 


.80 


.90 


.83 


.88 


.83 


.39 


.79 


.89 


.86 


24 


.80 


.89 


,81 


.90 


.83 


.89 


.84 


.89 


.80 


.89 


.86 


25 


.81 


.89 


.82 


.90 


.84 


.89 


.84 


.89 


-80 


.89 


.87 



Note: Numbers of items estimated to be needed to obtain a G coetticient ot .80 are shown in bold. 



variance estimated from the full sample, and 12 did not contain 
the value of the mean percentage of variance estimated from the 
100 samples of the specified condition. 

As noted above, a series of D studies was conducted to illustrate 
how estimates of G coefficients might vary depending on the sam- 
pling design and the specific sample used in the G study. Because 
such decision studies are often used to determine a minimal number 
of items to be administered to obtain a specified G coefficient 
(much as the Spearman— Brown prophecy formula is used in clas- 
sical test theory), the number of items was varied from 1 to 25. 
These results are presented in Table 2. One result of interest is the 
number of items estimated to he needed to obtain a G coefficient 
of .80. This value is in bold in each column. 

Considering the 90th percentile samples for all five conditions, it 
can he seen that for four of the five conditions the estimate of the 
number of items needed to achieve a G of .80 is 12; for the second 
condition the estimate is 11. The estimate based on the full sample 
is 15 items. Considering die 10th percentile samples for the five 
sampling conditions, the estimated numbers of items needed range 
from 19 to 24. Comparing the 10th and 90th percentile samples 
within conditions, substantial differences in estimates are apparent. 
For instance, for condition 5 (Np - 100, N, = 8, N, = 2) the number 
of items estimated to be necessary at the lOrh percentile is 24, versus 
only 12 items if the sample at the 90th percentile is used. 

Discussion and Implications 

TTie results presented aKwe suggest that, at least for the data set 
used here, different data-collcction designs would have had little 



impact on average on the variance-components estimates obtained. 
In other words, collecting ratings of 25 examinees on 16 items using 
four raters would have resulted in approximately the same variance- 
components estimate.^ as collecting ratings of 100 examinees on 
eight items using two raters. In all the conditions studied here, 
including the full sample, it was clear that a far higher percentage 
of the variability in scores was related to the item facet, and the 
associated interactions, compared with that associated with the 
rater facet and associated interactions. Thus, conclusions as to 
the relative impacts of item and rater would have been similar 
regardless of which data-collection design was used. 

While the average percentages of variance associated with the 
individual facets were very similar across the five conditions, the 
widths of the associated confidence intervals (both empirical and 
computed) did vary across conditions. In addition, for all of the 
five conditions, estimates of the numbers of items needed to obtain 
a given level of generalizahiliry varied considerably depending on 
whether the 10th percentile or the 90th percentile sample was used. 
Since in practice investigators have only one sample, and no 
knowledge of where their sample falls in the distribution, it is im- 
portant to he aware of the fact that substantially different estimates 
might have resulted if a different sample had been used. The results 
of the D studies reported here highlight this. Depending on the 
condition, the numbers of items required to obtain a G of .80 
differed by as much as 100% depending on the specific sample used 
in the G study. This is particularly important given the time and 
expense associated with most performance assessments. In this case, 
analysis of the full data sot suggests that 15 items would be needed 





to achieve a G coefficient of .80. While this is also, of course, a 
sample, it can be considered our best estimate of the true number 
of items needed. If, instead of the full data set, we had had only 
one of the samples investigated here, we might have come to the 
conclusion that a test of only 11 or 12 items would result in a G 
coefficient of .80. Were we then to administer a 12'item test based 
on these results, it is likely that the results would be less general- 
izable than expected, a result with potentially serious consequences 
for test developers and users. In contrast, using a different sample 
we might conclude that 24 items were needed to obtain sufficiently 
generalizable results. While this overestiination would not be a 
problem from a psychometric perspective, the costs associated with 
administering more items than are in fact needed to achieve a 
specified G could be. In fact, in some circumstances, an overesti- 
mation error could result in a decision that a particular testing 
format is not feasible given cost and time constraints. 

It is difficult to interpret the results found for the computed con- 
fidence intervals, particularly since these were calculated based on 
a single sample and would be expected to differ if a different sample 
were selected. In comparing the computed confidence intervals 
with the empirical confidence intervals, substantial discrepancies 
were found, especially for the components that accounted for 
higher percentages of the variance. In some cases discrepancies 
were found not only in the width of the interval but also between 
the values within the interval — at times the intervals were not 
even overlapping. These findings raise questions as to the usefulness 
of Satterthwaite’s technique in this instance. 

Conclusions 

It is important for test developers, psychometricians, and test users 
to remember that generalizability coefficients and other reliability 



coefficients are estimates based on samples, and as such may be 
expected to var^' depending on the specific sample used in esti* 
mation. The results of the study reported here highlight how dif- 
ferent samples may produce different results, which can in turn lead 
to very different decisions. The fact that the computed confidence 
intervals differed substantially from the empirical confidence inter- 
vals, and in some cases did not even contain the appropriate per- 
centage, suggests that computing confidence intervals from a single 
sample will not necessarily improve decision making. This study 
does not allow us to make specific recommendations regarding the 
number or distribution of data points required when conducting a 
G study. Which design provides the most stable estimates will de- 
pend on the nature of the data collected. It is ideal, naturally, to 
obtain the largest sample possible; but when smaller samples are 
used, it is crucial that the stability of the estimate be taken in to 
consideration before decisions are made based on a specific sample, 
as was illustrated in this study. 
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A Validity Study of the Writing Sample Section of the Medical College Admission Test 
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The cunent version of the Medical College Admission Test 
(MCAT), introduced in 1991, includes four sections; Biological 
Sciences, Physical Sciences, Verbal Reasoning, and Writing Sam- 
ple. The Writing Sample assesses skills in organizing thoughts and 
presenting ideas in a cohesive manner, and provides evidence of 
analytic thinking and writing skills.’ Scoring is based on two 30- 
minute essays about general topics. An example of an essay prompt 
is "In a free society, individuals must be allowed to do as they 
choose." 

Each essay is holistically scored by two trained reviewers on a 
six-point scale with regard to specific criteria such as developing 
the central idea, synthesizing concepts logically, and writing clearly 
with good grammar, syntax, and punctuation. Essays receiving 
scores that differ by more than one point are re-evaluated by a 
third expert reviewer. The scores for the tvjo essays completed by 
each examinee are summed and converted to an 11 -point alpha- 
betical scale ranging from J to T According to reports by the As- 
sociation of American Medical Colleges (AAMC), 98% of the es- 
says are given identical scores or scores within one scale point of 
each other by the independent review'ers.’ 

The results of multi-institutional studies, conducted by the 
MCAT Validity Study Advisory' Group, ^ have been published and 
presented at professional meetings.^"^ However, while the need for 
additional studies of the psychometric properties of the MCAT con- 
tinues, there is a particular need for study of the predictive power 
of the Writing Sample. Tire unique alphabetic scores of the Writing 
Sample discourage the usual correlational analyses used in validity 
studies. Although it is possible to convert the alphabetic scores to 
the integers from 1 to 1 1 by assuming that the letters constitute an 
interval scale, such an assumption might not be widely accepted. 

We designed the present study to examine the validity of the 
Writing Sample section of the MCAT for students at Jefferson 
Medical College in Philadelphia, Pennsylvania. We speculated that 
the ability to organize and express ideas effectively in writing could 
have relevance to the analytic and problem-solving skills demanded 
in clinical performance. Furthermore, such skills might also be re- 
lated to a better presentation of one’s self, and to effective verbal 
expression of ideas, both of which are critical in promoting inter- 
personal relationships. Therefore, we hypothesized that scores on 
the Writing Sample would be associated more closely with indi- 
cators of clinical competence than with measures of achievement 
in basic sciences. 

Method 

Data for 1,776 matriculants (1,086 men, 690 women) at Jefferson 
Medical College between 1992 and 1999 were retrieved from the 
database of the Jefferson Longitudinal Study of Medical Education.* 
The students were classified into three groups (top, middle, and 
bottom) based on their scores on the Writing Sample. The "top” 
group included 314 (18% of the sample) who scored R, ^S, or T. 
The "middle” group consisted of 1,115 (65%) who scoretN, O, P, 
or Q. The 307 (17%) .students who scored J, K, L, or M comprised 
the "bottom" group. 

Three sets of criteria were used. 

36 



■ Admission n\easures. The first set included the measures typically 
used for screening applicants, such as undergraduate grade-point 
averages (UGPAs) in science and non-sciep.ce courses, admission 
interview scores, and MCAT scores on Biological Sciences, Phys- 
ical Sciences, and Verbal Reasoning. 

■ Performance in rJie basic sciences. The second set consisted of 
achievement measures in the basic science disciplines, including 
grade-point averages (GPAs) in first- and second-year medical 
school courses. Scores on Step 1 of the United States Medical 
Licensing Examin ’tions (USMLE) were also included. 

■ Per/ormance in clinical scietures and raungs of climeal competence. 
Included in this set were scores on written examinations in six 
core clerkships (family medicine, internal medicine, obstetrics- 
gynecology, pediatrics, psychiatry, and surgery) in the third year 
of medical school. Written examinations in basic and clinical 
sciences are in either multiple-choice or uncued formats/ with 
reliability estimates usually over r = .75. 

Combined global ratings of clinical competence in the six core 
clerkships, on a 100-point scale,** and scores on Step 2 of the 
USMLE were also included. In addition, medical school class rank 
(percentile), a composite measure with two thirds weight for clin- 
ical competence in the core clerkships and one third weight for 
the combined first- and second-year GPAs,*‘^ was used, as were the 
ratings of graduates’ clinical competence from a 33-item rating form 
measuring three clinical competence areas of "data-gathering and 
processing skills" (16 items), "interpersonal skills and attitudes” 
(ten items), and "socioeconomic aspects of patient care" (seven 
items). These ratings were made on a four-point Likert scale by 
program directors near the end of the first postgraduate year. Data 
have been reported in support of the measurement properties of 
this rating form, including construct validity (factor structure), the 
internal consistency aspect of reliability’, and criterion-related va- 
lidity. 

Continuous measures were transformed ro a distribution with a 
mean of 100 and a standard deviation of 10 to facilitate compari- 
sons of the magnitudes of differences on a scale with a uniform 
mean and .standard deviation. This transformation was used to mit- 
igate the issue of scale incompatibility within each class and be- 
i^'een classes. The numbers of observations vary' for different anal- 
yses because data were not yet available fur the entire sample at 
the time of this study. 

The three groups were- compared with respect to the criterion 
measures by using analysis of variance for continuous measures, fol- 
lowed by the Duncan test and the Kruskal-Wallis test for class 
rank. Analysis of covariance was also employed to make statistical 
adjustments for baseline differences in the scores of other MCAT 
sections. 

Results 

Admission Variables. The means and sample sizes for the criterion 
measures and a summary of the statistical analyses arc presented in 
Table 1. Comparisons of the top, middle, and bottom groups on 
the Writing Sample showed no significant difference for under- 
graduate science GPA, or for the Biological and Physical Sciences 
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Table 1. Means of Selected Admission Measures and Performances In Basic and Clinical Sciences by Scores on the Writing Sample Section of the 

Medical College Admission Test (MCATP)* 



Mean Criterion Measures tor Three Groups 
Classified by Level of Writing Sample Scoresf 

Top Middle Bottom Effective 

Criterion Measure^ (/7 = 314) (n= 1,155) {/? = 307) n§ f-ratio p 



Admission 

Undergraduate GPAs: science 100.2 

Undergraduate GPAs: non-science 101.3 

Admission interview 100.2 

MCAT: Biological Sciences 100.2 

MCAT: Physical Sciences 100.2 

MCAT: Verbal Reasoning 103.1 

Basic sciences 

Medical school: 1st- & 2nd-year GPAs 100.9 

USMLE:Step1 100.8 

Clinical sciences and ratings of clinical competence 
Medical school: 3rd year (objective tests) 101.9 

Medical school: 3rd year (clinical ratings) 101.9 

Medical school class rank 86.0 

USMLE: Step 2 100.6 

Postgraduate ratings: data gathering 103.1 

Postgraduate ratings: interpersonal & attitudes 102.8 

Postgraduate ratings: socioeconomics of patient care 102.8 

Postgraduate ratings: physician as a clinician 103.2 

Postgraduate ratings: physician as an educator 103.4 

Postgraduate ratings: physician as a manager 101.5 



100.0 


99.7 


1,766 


.21 


.81 


99.9 


99.1 


1,766 


4.3 


.02 


100.0 


99.6 


1,769 


.30 


.74 


99.6 


99.0 


1.776 


1.3 


.28 


99.4 


99.1 


1,776 


1.2 


.28 


99.5 


96.8 


1,776 


31.9 


<.01 


100.0 


99.2 


1,535 


1.8 


.17 


99.9 


99.7 


1,271 


.97 


.38 


100.0 


98.0 


1,036 


6.0 


<.01 


100.0 


98.0 


1,036 


5.9 


<.01 


85.5 


84.8 


1,036 


6.1 


<.01 


99.9 


98.7 


1,006 


3.1 


.04 


100.0 


97.9 


433 


2.7 


.07 


100.0 


97.1 


433 


3.2 


.04 


100.0 


97.7 


433 


2.5 


.08 


99,6 


98.5 


423 


2.4 


.09 


99.0 


97.8 


364 


2.8 


.06 


99.5 


98.4 


339 


.71 


.49 



•Participants were 1.776 students who entered Jefftirsor. Medical College between 1992 and 1999. 

t“Top" category includes R. S. T; "middle” category includes N, 0. P. Q; and "bottom” category includes J, K. L. or M alphabetic scores. 

iWith the exception of medical school class rank, all other criterion measures for each entering class were transformed to a distribution with a unilorm mean of 100 and a standard 
deviation of 10 to facilitate comparisons of mean differences. 

§ Numbers of observations vary due to unavailability nf data at the time of the study. 



sections of the MCAT. However, significant differences were ob- 
served for undergraduate non-science GPA (p < .05), and the Ver- 
bal Reasoning test (p < .01). Duncan tests indicated that the top 
groups undergraduate non-science GPA was signiflcanrly higher 
than those of the middle and bottom groups (p < .05). As expected, 
the top group also obtained the highest mean score in Verbal Rea- 
soning, followed by the middle and bottom groups (p < .01). 

PeT/ormances in Basic Sciences Disciplines in Medical School. Data 
reported in Table 1 indicate that although the top group consis- 
tently outperformed the bottom group in first- and second-year ba- 
sic science courses, as well as on USMLE Step 1, the differences 
were nor statistically significant. 

Performances in Clinical Science Disciplines and Ratings of Cliyiical 
Competence. Statistically significant differences were observed 
among the top, middle, and bottom groups on a number of perfor- 
mance measures in clinical disciplines. Both the top and the middle 
groups obtained significantly higher mean grades (p < .01) than 
did the low group on written examinations in the six core clerk- 
ships. A similar pattern of findings was observed for medical school 
class rank. 

The top group was also rated significantly higher than the middle 
and bottom groups in global ratings of clinical competence in the 
third-year core clerkships (p < .01). The difference between the 
top and bottom groups’ Step 2 scores was also statistically signifi- 
cant (p < .05). 

Results for the six measures of clinical competence in residency 
showed that the differences for ratings in interpersonal skills and 
attitudes were statistically significant ( p < .05), where the top group 
was rated significantly higher than the bottom group. Although the 
differences in other areas of postgraduate competence did not reach 
the conventional level of statistical significance (p < .05), a con- 
sistent pattern was observed in which the highest average ra^.^gs 



were obtained by the top group, and the lowest by the bottom 
group. 

In additional analyses, the two extreme groups (top and bottom) 
were compared regarding the ratings in other areas of clinical com- 
petence in residency, and standardized effect-size estimates (d ) were 
calculated for the significant pairwise differences. The top group 
was rated higher than the bottom group in data-gathering and pro- 
cessing skills (p < .05, estimated effect size = .52), socioeconomic 
aspects of patient care (p < .05, effect size = .51), and physician 
a.s a patient educator (p ^ .05, effect size = .56). Effect-size esti- 
mates of this order of magnitude are not small according to Cohen’s 
definition.*^ Tliese differences are not only statistically significant, 
but also of practical significance. 

Controlling for Differences on the Other Sections of tlte MCAT. 
Statistical adjustments were made for baseline differences using 
both the Biological Sciences and the Physical Sciences sections of 
the MCAT as covariates through analysis of covariance. Each of 
the previously-reported differences among the three groups re- 
mained unchanged. This confirms that the previous findings were 
not confounded by score differences in these two sections of the 
MCAT. 

Further statistical adjustments were made by adding scores on 
the Verbal Reasoning section of the MCAT to the other tw'o co- 
variates (scores on the Biological and Physical Sciences sections). 
The differences remained unchanged for the following criterion 
measures: clinical clerkship examinations (adjusted p = .02), clin- 
ical clerkship ratings (adjusted p = .02), and medical school class 
rank (adjusted p = .008). However, changes in statistical signifi- 
cance levels were observed in the undergraduate non -science GPAs 
(adjusted p = .10), Step 2 .scores (adjusted p = .31), and postgrad- 
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uate ratings of data-gathering and data-processing skills (adjusted 

p= .11). 

Discussion 

The findings of the present study confirm the research hypothesis 
that scores on the Writing Section of the MCAT yield a closer 
association with measures of clinical competence than with 
achievement in the basic sciences. 

These findings provide support for the validity of the Writing 
Sample from a number of perspectives. We hypothesized that high 
scorers on the Writing Sample would outperform others in clinical 
sciences evaluations and in ratings of clinical competence. The 
hypothesis was confirmed, providing support for the predictive va- 
lidity of the test. 

The fact that scores on the Writing Sample were significantly 
associated with performance in the clinical areas in medical school 
and residency provides evidence in support of convergent validity*, 
whereas their lack of associations with measures of achievement in 
science prior to and during medical school supports the discrimi' 
nant validity of the test. In addition, concurrent validity was dem- 
onstrated by the relationships between, the Writing Sample and 
Verbal Reasoning scores. 

Clinical grades in medical school are based on the obser\^ations 
of faculty and supervising residents during the actual provision of 
clinical care to patients, and reflect the ability of students to relate 
well to others. These dimensions of clinical competence require 
basic medical knowledge, which may be predicted on the basis of 
MCAT science scores. However, while necessary, medical knowh 
edge is not sufficient for effective clinical decision making. The 
significant relationship between the Writing Sample scores and 
clinical ratings after adjustment for MCAT science scores confirms 
that the associations between Writing Sample scores and measures 
of clinical performance are beyond those that would he expected 
from attainment of knowledge only. Therefore, it can be concluded 
that the Writing Sample measures a unique skill, different from 
those measured by the other sections of the MCAT, including the 
Verbal Reasoning section. It can be speculated that such a unique 
skill might be attributed more to factors that are not associated 
with achievement in sciences. Such speculation needs to he verified 
further by empirical evidence. 

The results generally suggest that, for a sample of students at one 
medical school, Writing Sample scores of ], K, L, or M predicted 
poorer clinical performance during and after medical school. This 
particular grouping of the Writing Sample scores should be studied 
further in samples from other medical schools before implementa- 
tion in decision making. 

Certain aspects of this study could he questioned and deserve 
comment. It may be argued that the statistically significant findings 
of this study could have been due to chance as a result of the large 
number of statistical comparisons that were performed. However, 
this argument can be refuted based on the findings for the 18 cri- 
terion measures reported in Table 1, While only one statistically 
significant finding would he expected by chance alone at p < .05, 
seven were reported in this table. Similarly, the internal validity of 
the findings could he questioned by arguing that the statistically 
' significant findings could be attributed to the large sample size, 
rather than underlying relationships among the variables. This ar- 
gument can also be refuted based on the findings that the signifi- 
cant associations were observed only for the conceptually relevant 



scores, such as Verbal Reasoning, whereas there was no relationship 
with the less relevant scores such as the Biological and Physical 
Sciences, despite the fact that the sample size was equally large (n 
= 1,776) in all ..nalyses. Furthermore, the magnitudes of the effect- 
size estimates between top and bottom scorers suggest that the ob- 
tained differences are of practical importance to decision makers. 

These findings, coupled with the relatively large sample size and 
the longitudinal design of this study, provide assurance for the in- 
ternal validity of the results. However, more data from other med- 
ical schools are needed to assure the external validity and the gen- 
eralization of the findings. 

In earlier studies we found that validity coefficients for the 
MCAT varied for students who graduated from different colleges 
and universities,'^ that the validity of the MCAT varied for differ- 
ent sets of scores when applicants lepeated the examination,*'^ and 
that different sections of the MCAT have different predictive va- 
lidity depending upon the criterion measures.'' Empirical evidence 
also suggests that validity' coefficients for the MCAT var>* among 
medical schools.’ It will be essential to consider these factors in 
future studies of the validity of MCAT. 

CorresponJence; MohainmaJrcza Hojai, PhO. Jefferson McJical College. PhilaJclrhia. 
PA 19107: e-mail; (.Mo/wmnwJrc^a.Hojai^nujil.tju.eJu). 
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• CLOSE BUT NO BANANAS: 

PREDICTING PERFORMANCE 

Prediction of Students’ Performances on Licensing Examinations Using Age, Race, Sex, 
Undergraduate GPAs, and MCAT Scores 

]. JON VELOSKl, CLARA A. CALLAHAN, GANG XU, MOHAMMADREZA HOJAT, and DAVID B. NASH 



Tiie annual selection of new students is one of the most important 
activities of medical school faculty. They face the challenge of se- 
lecting those who can perform well not only in the preclinical 
years, but also in the clinical arena of medical school, in graduate 
medical education, and beyond.' To make sound, evidence-based 
decisions, faculty involved in the admission process depend on em- 
pirical studies that examine the relationship of an applicant’s aca- 
demic performance before medical school to that individual’s aca- 
demic performance during medical school and afterwards. 

Studies have consistently shown that Medical College Admission 
Test (MCAT) scores and undergraduate grade-point averages 
(GPAs) are the most important indicators of students’ future aca- 
demic performances. 

Specifically, MCAT science scores and undergraduate science 
GPAs have been associated with preclinical academic perfor- 
mance.^ However, verbal scores on the MCAT and non-science 
GPAs have been more closely associated with performance in the 
clinical years, such as on the United States Medical Licensing Ex- 
amination (USMLE) Steps 2 and 3.^ Correspondingly, the combi- 
nation of GPAs and MCAT scores has been shown to be the best 
predictor of preclinical academic performance. 

The predictive strength of MCAT scores and GPAs is less cl*, .it 
when students’ race and sex have been considered, and when per- 
formance has been followed longitudinally beyond the preclinical 
years. Men on average have outperformed women on the USMLE 
Step 1. The differences were moderated, but not eliminated, by 
statistical control for differences in prematriculation measures. 
Conversely, women have outperformed men on the National Board 
of Medical Examiners (NBME) Part 11, though the differences were 
not as great as those observed between the scores of men and 
women on Part I, where men outperformed women.’’ Control for 
differences in prcmatriculation measures and Part 1 pcrfomianccs 
increased the magnitude of differences bctw'een women and men 
on Part II. This phenomenon had been noted several decades ear- 
lier.^"® Finally, the findings related to students’ ages have been 
equivocal, often because age has been confounded with sex or un- 
dergraduate academic performance.'^ 

Studies among racial groups have revealed substantial differences 
in performances on Part 1. Although white students on average 
have scored highest, followed by Asian Americans, Hispanics, and 
African Americans, these gaps become narrower after controlling 
for MCAT scores and undergraduate GPAs.‘° One might expect 
Asian Americans, who as a group have had the highest mean 
MCAT scores, to outperform other racial groups during medical 
school. However, two major studies across time and across medical 
schools have lept^rted lower mean performance for A.sian Ameri- 
cans than for white students in medical school. 

In summary, previous admission-prediction studies have looked 
at the predictive value of MCAT scores and GPAs for USMLE 
Step 1 performances among racial groups,'^ clcrk.ship perfor- 
mance during medical school, '** and a combination of Step 1 and 
clerkship performances.'^ Other studies ha\'c ignored either stu- 
dents’ age, race, or sex when examining the correlation between 
prematriculation measures and students’ performances during med- 
ical school, or have studied characteristics such as race without 
controlling for GPAs and MCAT scores." 

At'AOtMlt: MhnUMNE, VoL. 75, 



We designed the present study to evaluate simultaneously the 
relative importances of MCAT scores, undergraduate GPAs, age, 
race, and sex in predicting performances on the three-step sequence 
of preclinical, clinical, and postgraduate licensing examinations. 

Method 

The sample consisted of 6,239 matriculants who entered Jefferson 
Medical College during the 30 years between 1968 and 1997, in- 
clusive. Tlie dependent variables were total scores on Parts I, 11, 
and III of the licensing examinations of the NBME and total scores 
on Steps 1, 2, and 3 of the USMLE (the latter three examinations 
replaced the fonner three several years ago). Although scores on 
either of the preclinical examinations (Part I, Step 1 ) were avail- 
able for every individual studied, scores on the second, clinical tests 
(Part II, Step 2) were available for only 5,887, because the others 
had either left medical school or not yet taken the test. Scores on 
the Part 111 and Step 3 examinations were collected prospectively 
for the 3,884 graduates (62%) who had given written permission 
and completed the examination at the time of the study. 

A separate multivariate linear regression model was generated for 
each of the six dependent variables. NBME scores were transformed 
from a mean and standard deviation of 500 and 100 to the USMLE 
scale of 200 and 20 to allow comparisons across the two time pe- 
riods, Repeated scores were averaged. MCAT scores in earlier time 
periods were transformed to the current scale and repeated scores 
averaged using methods reported previously.” Scores on science 
subtests were averaged to estimate an overall science score. Sex was 
coded 0 for men and 1 for women, who were 26% of the entire 
cohort. Students who were more than 23 years old ar the time of 
matriculation (also 26% of the cohort) were coded 1 and others 
were coded 0. An earlier study of a portion of the cohort confirmed 
that this age cut-off serves as a proxy for nontraditional students.’ 
Racial-ethnic backgrounds, as defined by the Association of Amer- 
ican Medical Colleges, consisted of the Asian, Oriental, or Pacific 
Islander groups (the Asian American group in this study); Hispanic 
(not white); black; and white. Students in each of the first three 
race categories were coded as either 1 for those in the group, or 0 
for those not. The percentages for Asian American, Hispanic, and 
black were 8.2%, 1.4%, and 2.8%, respectively- The other students, 
who included 85.9% white and L7% in other racial groups with 
very small sample sizes, were not coded separately. 

Results 

Each of the six linear regression models shown in Table 1 was 
statistically sigriificanr (F test, p < .05). The proportions of variance 
explained ranged from a high for Step 1 of .26 to a low for Part 
III of .15. We report only the linear regression weights for the 
independent variables that were significant (t-test, p < .05). The 
b-coefftcients for independent variables provide information ahoiit 
the absolute contribution of each variable as a predictor of the 
dependent variable. Tlie beta-coefficients, which are transforma- 
tions of the b'Cocfficients to a uniform scale across all independent 
variables, enable comparisons of the relative importance among the 
independent variables. For example, the h-coefficient of 4-26 for 
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Table 1. B* and Beta^coefflcients, Sample Sizes, and Proportions of Variance for Regressions Predicting Performances on NBME and USMLE 
Examinations from Applicant Data for Matriculants from 1968 through 1997^ 









NBME 












USMLE 










Part 1 




Part II 




Part ill 




Step 1 




Step 2 




Step 3 




Variable 


B 


Beta 


B 


Beta 


B 


Beta 


B 


Beta 


B 


Beta 


B 


Beta 


MCAT Science 


3.13 


0.31 


2.56 


0.23 


1,96 


0.16 


■4.26 


0.34 


3.19 


0.22 


1.31 


0.11 


MCAT Verbal 


1.32 


0.14 


2.38 


0.21 


1.73 


0.14 


1.49 


0.13 


2.76 


0.21 


3.07 


0.28 


Science GPA 


7.40 


0.18 


7.70 


0.18 


6.89 


0.18 


9.41 


0,21 


9.22 


0.18 


4.48 


0.11 


Non-science GPA 
Asian American 


-5,52 


-0.06 


-7.98 


-0,09 


-9.80 


-0.09 


-7.60 


-0.18 


-11.47 


-0.21 


-7.60 - 


-0.18 


African American 






-4.00 


-0.04 


-8.55 


-0.05 














Hispanic 

Woman 


-3.63 


-0.09 






-6.60 


-0.03 






3.54 


0.08 


3.10 


0.09 


Older 


























Constant 


139.48 




133.39 




142.61 




120.67 




112.37 




146.31 




n* 


4.299 




4,227 




3,234 




1,940 




1.660 




650 




R2 


0.23 




0.21 




0.15 




0.26 




0.23 




0.17 





•NBME scores, which were rescaled to a mean of 200 and standard deviation of 20, were available for students who entered before 1989. USMLE scores were used thereafter. 
Only b*coefficients and beta-coefficients that were significant at p < .05 by a f-test that b = 0 a^e reported. Blank values were not significant. 



the MCAT science score in the USMLE Step 1 model indicates 
that a one-point increment in a student’s MCAT science score 
raises his or her predicted Step 1 score by 4.26 points. Comparison 
of the beta -coefficient of .34 for the MCAT science score with the 
beta-coefficient of .21 for science GPA indicates that the unique 
contribution of die MCAT science score as a predictor of Step I 
is about one and a half times that of the science GPA. The validity 
of these interpretations of beta-coefficients assumes equivalent var- 
iability across independent variables, which has been documented 
in other published studies of portions of this cohort.*** 

As would be expected, the contriburion of the MCAT science 
score in predicting scores on the preclinical examination was more 
important than that of the science GPA. Being an older, nontra- 
ditional student at matriculation was unrelated to all scores after 
controlling for the other independent variables. The regression cO' 
efficients for women were negative for the NBME Parr I, but in- 
significant for Step 1. However, being a woman was positively as- 
sociated with the scores on USMLE Steps 2 and 3. Although being 
black was negatively associated w'ith performances on Parts II and 
III, and being Hispanic negatively associated with performance on 
Part III, these patterns disappeared in the more recent USMLE 
examinations. Overall, the only consistent pattern related to age, 
race, or sex across all examinations was the negative regression 
weight for Asian .American students. 

Discussion and Conclusion 

This longitudinal study examined the absolute and relative contri- 
butions of MCAT scores and undergraduate CPAs, along with age, 
race, and sex, in predicting students’ performances on the sequence 
of three licensing examinations over the past three decades. TLie 
analysis reflected a large number of subjects, including a small frac- 
tion of students who reached Part 1 or Step 1 but did not graduate 
from medical school. Although the dependent variables were lim- 
ited to licensing examinations, these are uniform measures that 
apply across all medical schools. 

As expected from many earlier studies, MCAT scores were con- 
sistently more valuable than were undergraduate GPAs as predictors 
of performance on licensing examinations, supporting their contin- 
ued use in selection decisions.’*^ These relationships are stable across 
three decades and apply to the three examinations. Verbal scores 
tended to be better indicators of performances in the clinical and 
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postgraduate tests. Although the non-science GPA never appeared 
in the six regression models, this may be due to the high correlation 
(r = 0.61) between science and non-science GP.As. There was no 
independent effect for older, nontraditional students after control- 
ling for their undergraduate academic performances and MCAT 
scores. 

Earlier studies have indicated that, although underrepresented 
minorities have entered medical school with significant educational 
disadvantages and have continued to score lower than other stu- 
dents on some measures, their clinical performances were nearly 
equivalent to those of other students.^^ In the present study, statis- 
tical control of the baseline differences at matriculation using re- 
gression analysis showed that underrepresented-minority students 
compared with white students performed less well than would have 
been predicted on the NBME in the earlier time period. However, 
this pattern disappeared in the recent time period. This change 
over time may have been due to the effectiveness of academic en- 
richment programs.^ ‘ It has been reported that the gap in MCAT 
scores and undergraduate GPAs between underrepresented minor- 
ities on one side and white and Asian American students on the 
other persists, supporting the need for programs aimed at enhancing 
students’ academic preparation before medical school.^’ 

The most striking finding is the large negative value of the b- 
coefficients as well as the beta-coefficients for Asian American stu- 
dents. This indicates that, after controlling statistically for science 
and verbal MCAT scores and undergraduate GPAs, these students 
performed less well compared with white students. Previous studies 
had revealed that Asian American students’ performances during 
medical school were not as good as those of white students, without 
controlling for prematriculation measures.” However, the differ- 
ences between Asian American and the undcrrcprcscnted-minority 
groups in Step 1 performances were significantly reduced after con- 
trolling tor prematriculation measurcs.‘^^ ‘rhe findings of the present 
study indicate that Asian American students’ performances fell be- 
low expectations on all NBME and USMLE examinations, after 
controlling for these prematriculation measures. 

One possible explanation for the decline in performance from 
the admission test to the licensing examinations may be that Asian 
American families are less able to influence academic achievement 
as their young adults mature. It has been reported that certain 
Asian American families place greater emphasis on doing well in 
school than do other groups.''* However, it is very impK'>nant to 
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consider that the sample used in the present study included a het- 
erogeneous mix of Asian American students from families that left 
diverse cultures in Asia at dift'erent points in time. Further studies 
are needed to evaluate these subgroups, to investigate other mea- 
sures of academic and clinical performance, and to better under- 
stand the factors that may influence Asian American students’ per- 
formances in medical school and beyond. 

Corrjspjn Jcnce: Jon V'closki, MS. Center for Research in Medical Education and 
Health Cire, Jefferson Medical College. 1025 Walnut Street, 1 19, Philadelphia, 
PA e-mail: ( jim.velosktSmail.tju-cdu). 
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Does Institutional Selectivity Aid in the Prediction of Medical School Performance? 
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Various factors are considered in the decision to offer an admission 
interview to a medical school applicant, including Medical 0>Uege 
Admission Test (MCAT) scores, undergraduate grade-point average 
(GPA), and the selectivity of the degree-granting undergraduate 
institution. Admission officers view MCAT scores, undergraduate 
GPA, and institutional selectivity as having high or moderate im- 
portance.’ Research has indicated that these factors, most notably 
the MCAT scores and the undergraduate GPA, are reliable in help- 
ing predict medical school performance. The strongest associa- 
tion has been shown between MCAT scores and performance on 
the United States Medical Licensing Examination, Step 1.‘ 

Institutional selectivity data are used to help control for differ- 
ences in grading stringency across undergraduate institutions.' Pre- 
vious reports have examined the role of institutional selectivity, or 
a specific undergraduate institution, as a predictor of performance 
in rhe first two years of medical schiiol.''"'' With the exception of 
the study of Zelosnik et al.,^ which examined ton specific under- 
graduate institutions, these reports have used the Higher Education 
Research Institute (HERl) Index,' also called the “Astin Index,”" 
as a measure of institutional selectivity. Other measures of institu- 
tional selectivity or categorization that schools of medicine may 
employ include the Barron’s Profiles of American Colleges Admis- 
sions Selector Rating*’ and the Camegie Classification from the 
Carnegie Foundation for the Advancement of Teaching.** (These 
measures arc explained in the next section.) 

Institutional validity studies of admission decision-making data 
help to determine which characteristics should be accorded highest 
importance in applicant selection. Given the reliance upon insti- 
tutional selectivity as an imjxirtant admission characteristic and the 
different types of selectivity classifications a\'ailable for medical 
schools to use, the purpose of this study was to examine how well 
three measures of institutional selectivity could predict medical stu- 
dents’ performances, specifically their performances on the USMLE 
Step 1 and Step 2 and the it final medical school GPAs. 

Method 



Medical school performance data consisted of USMLE Step 1 
and 2 scores and final GPA. Students admitted under the institu- 
tion’s existing Early Assurance Program (EAP) were excluded from 
analysis because an MCAT score was not required for their admis- 
sion. (The EAP offered admission to exceptional applicants during 
their undergraduate education based on the applicants’ SAT scores, 
undergraduate GPAs, medical school admission interview ratings, 
and the understanding that the applicants would not apply to an- 
other medical school. This program stopped selecting applicants for 
admission to MUSC in 1996). 

To avoid having insufficient subgroup sire, we dichotomized the 
Barron’s Admissions Selector Ratings and the Camegie Classifica- 
tion categories based upon logical breakpoints in the categories. 
Calculated frequency distributions indicated that these breakpoints 
separated into approximately equal numbers of matriculants in each 
selectivity index or categotLation grouping, thus confirming the 
breakpoints. The Barron’s Admissions Selector Ratings were di- 
chotomised into “most/highly competitive” (includes Barron’s cat- 
egories “most competitive,” “highly competitive + ” and “highly 
competitive”) versus “not highly competitive” (“ver^* competi- 
tive -r,” “very competitive,” “competitive,” “less competitive,” and 
“not competitive”). The Carnegie Classification categories were di- 
chotomised into cithc * “research-doctoral” (includes Camegie 
Classification Research i and 11 and Doctoral 1 and 11 institutions) 
and “not research-doct >ral.” 

Descriptive statisrics were performed for all variables. Stepwise 
linear regression (adjusted r-squarc method) was used to assess 
which control variables (undergraduate GPA, gender, URM status, 
age) contributed significantly to predicting USMLE Step 1 and 
Step 2 scores and final GPA. Age was the only control variable 
that did not contribute significantly to predicting any of the de- 
pendent variables. Multiple linear regression was then performed 
with each of the institutional selectivity or categorization indices, 
controlling for URM status and gender. The powers of the multiple 
regression equations ranged from to 96.0% for an alpha of 

0.05 and with estimating of small effect sizes. 



.Admission and medical school performance data were obtained for 
the 1992-1995 matriculants at the study institution, the Medical 
University of South Carolina (MUSC). Admi.ssion data for each 
student consisted of his or her MCAT scores, undergraduate GPA, 
undergraduate institution, three institutional selectivity categori- 
zation indices (the 1983 HERl index, Barron’s Admissions Selector 
Rating, and the Carnegie Classification), age, gender, and under- 
represented minority (URM) status. The 1983 HERl index consists 
of the mean total SAT score tor all students admitted in 1983 to 
U.S. undergraduate institutions. The Barron’s Profile of American 
Colleges Admissitms Selector Rating indicates the degree of com- 
petitiveness of admission to a college.'" The Camegie Classification 
includes most colleges and universities in the United Stares that 
arc degree-granting and accredited by an agency recognized by the 
U.S. Secretary of Education.** The Camegie Classificaium is not 
meant to he a measure of selectivity. It Ls a classification ctf insti- 
tutions into 19 categories based upon the ranges and tyjxS:bf de- 
Sme-granting programs at the institutions (doctoral through as,so- 
ciatc of arts) and the amount of federal support annually received 
*teach institution. 



Results 

For the 1992-1995 academic years, 545 applicants matriculated at 
MUSC. Of these, 1 12 were admitted under MUSC’s Eiirly Assuu' 
ance Program (therefore missing MCAT score.s) and were thus ex- 
cluded from the study. Institutional selectivity index or categori- 
zation data were inci'implete for an additional 28 matriculanr.s, 
leaving complete data for 405 matriculants (73.3%). 

Two hundred and sixty of the matriculants studied (64%) were 
men; 70 of the matriculants (17%) were from URM groups. The 
mean age was 24.0 years (SD = 4.0). The average total of MCAT 
subscores was 27 (SD = 4-2) and the average undergraduate GPA 
was 3.4 (SD = 0.40). Based upon the dichotomized Barron's Ad- 
missions Selector Rating, 235 of the matriculants (58%) had gone 
to undergraduate institutions that w'cre classified ns “not highly 
competitive." Using the dichotomized Camegie classification, 233 
of the matriculants (57%) had gone to research or doctoral under- 
graduate institutions. The mean USMLE Step I score was 205 (SD 
= 21), and the mean USMLE Step 2 score was 202 (SD = 21). 
The mean final medical school GPA wa.s 3.3 (SD = 0.38). 
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Table 1 presenis adjusted r-squared values for eight multiple re^ 
gression models computed for the three dependent variables. All 
models predicted statistically significant variations in the depen- 
dent variables. Uniformly, the worst-fitting model was that which 
consisted of only the three control variables GPA, gender, and 
URM status. The amounts of explained variation ranged from 17% 
to 32%, Addition to the model of any institutional selectivity- index 
or categorization slightly improved prediction (as measured by pro- 
portion of variation explained) above the prediction provided with 
GPA and demographic characteristics alone. When the MCAT 
score was added to the model involving the control variables and 
the GPA, it improved predictive ability of the equation by 6-13%. 
The addition of the institutional selectivity indices or categoriza- 
tion after the MCAT score was in the model added nothing to the 
predictive ability. Control variables plus MCAT score accounted 
for 38% of the variation in USMLE Step 1 scores, 38% of the 
variation in final GPA, and 28% of the variation in USMLE Step 
2 scores. 

Discussion 

During the medical school admission process, the selectivity' of the 
degree-granting undergraduate institution is used to help control 
for grading differences across undergraduate institutions. Our results 
show that none of the three institutional selectivity indices or cat- 
egorizations (i.e., HERl, Barron’s, or Carnegie) and any GPA ad- 
justment that would follow will improve correlation with perfor- 
mances on USMLE Step 1 and Step 2 and final GPA if MCAT 
scores and unadjusted GPA are used in conjunction. While the 
Barron’s and HERI indices are meant to be measures of institutional 
selectivity, the Carnegie classification is a description of the degree 
spectrum offered. Even evaluating schools by the type of degree 
offered produced no added benefit to prediction. 

Previous studies have shown that selectivity measures aid pre- 
diction of the USMLE Step 1 score and the CPAs in medical 
.school years one and two if used in a model without the MCAT 
scores. However, those studies used only one measure of institu- 
tional selectivity, the HERl,^’* or a sampling of undergraduate in- 
stitutions.^ Our study evaluated three different methods of classi- 
fying the selectivity or type of undergraduate institution, and none 
improved prediction in models that included the MCAT score. Fur- 
thermore, our study examined performance on USMLE Step 2 and 
final medical school GPA, performance indicators beyond the first 
two years of medical school. 

Our findings suggest that using institutional sclecti\'ity indices or 
categorizations as an admission characteristic may not be necessary. 
In addition, use of institutional selectivity indices or categorizations 
may discriminate against applicants with other desirable character- 
istics who have been granted degrees from less selective undergrad- 
uate institutions. For example, use of the average SAT score might 
unfairly discriminate against applicants who graduated from large, 
state-sponsored universities. The lack of correlation with the Car- 
negie classification also indicates that the size or academic com- 
prehensiveness of the degree -granting institution has little bearing 
on individual performance in medical schtxiL Our results should 
reassure admission officer.^; that the performances of students who 
attend smaller undergraduate institutions or community colleges arc 
predictable when using their MCAT scores and undergraduate 
CPAs. 

One limitation of this study is that it relied upon data from only 
one, state-supported, medical sch<iol. However, matriculants at the 
school come from diverse undergraduate institutions, 116 for the 
individuals in this study. Additional research should examine this 
issue at other medical schools, both state-supported and private and 
in various regions of the United States. Another limitation is. that 
because multiple linear regression was used, correlations »^ith 
USMLE scores and final CPAs cannot he adjusted for restriction 



Table 1. Percentages of Variation Accounted for in Predicting USMLE 
Step 1 and Step 2 Scores and Final Grade-Point Average with Three 
Institutional Selectivity Measures for 1992-1995 Medical University 
of South Carolina Matriculants 



Percentage of Variation 



Model* 


USMLE 1 
Score 


USMLE 2 
Score 


Final 

GPA 


GPA -F gender -f URM 

GPA + gender + URM + Barron’s 


25.13 


17.22 


32.18 


selectivity index 

GPA + gender -f URM -i- Carnegie 


25.96 


19.45 


33.77 


classification 

GPA + gender + URM -f Higher 
Education Research Institute selectivity 


26.40 


17.83 


32.85 


index 


26.27 


18.14 


33.04 


GPA + gender -f URM -t- MCAT 
GPA *f gender + U.RM + MCAT + 


38.23 


27.07 


37.63 


Barron’s selectivity index 
GPA + gender + URM MCAT + 


38.25 


27.80 


38.51 


Carnegie classification 
GPA -F gender + URM + MCAT -i- Higher 
Education Research Institute selectivity 


38.52 


27.15 


37.88 


index 


38.24 


27.08 


37.81 



* GPA = undergraduate grade-point average; gender = man or woman; URM = un- 
derrepresented minority; Barron’s selectivity index = Barron's Profile of American Col- 
leges Admission Selector ratings dichotomized into "most/hlghly competitive" versus 
“not highly competitive": Carnegie classification = Carnegie Foundation for the Advance- 
ment of Teaching Classification dichotomized into "research-doctorar’ versus "not 
research-doctoral"; Higher Education Research Institute selectivity Index = mean total 
SAT score for all students admitted in a given year at a particular institution. 



in range. Thus, the adjusted r-square values presented in Table I 
are, in all likelihcxid, underestimates of the relationships between 
the models and the dependent variables for the applicant pool. In 
addition, the dichotomization of the Barron’s Admissions Selector 
Ratings and the Carnegie Classification categories may also have 
had some impact on our results. However, any contributed bia.s 
would likely have strengthened the ability of institutional selectiv- 
ity to influence the performances of students. Another limitation 
is that the HERl index, although the most recent currently avail- 
able, is quite dated (1983); hence, the HERl index may not be 
representative of today’s undergraduate institutions. Finally, this 
study focused on primarily cognitive measures of academic achieve- 
ment in medical school. The predictive value of institutional se- 
lectivity indices or categorization on performances in clinical set- 
tings also should be explored. 

In summary', our results indicate that the characteristics of the 
degree-granting undergraduate institution, as measured by three dif- 
ferent types of institutional selectivity or categorization, do not add 
to the ability to predict performances on USMLE Steps 1 and 2 
and overall medical school GPA if the MCAT score and unadjusted 
undergraduate GPA are available. The results also further support 
the predictive validity of the scores on the MCAT examination for 
medical school performance. 

Correspondence: Amy V. Blue, Phl^ Qdkyc uf Moiu-ine, I3e.ma Office, Mcdic.il 
UniverMfy of S<uiih C.^nilina. 96 Jonmh.'in Luc.-i.. Street, Suite 601, Chnrlcston. SC 
29425; eonail; (hliic.iv@miisc.cJu). 
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# SOMETHING OLD, SOMETHING NEW 



Moderator: John Litde field, PhD 



The Presence of Hospitalises in Medical Education 

JUDY A. SHEA, YASMINE S. WASH. KIMBERLY J. KOVATH, DAVID A. ASCH, and LISA M, BELLINI 



Over the past few years, the care of medical inpatients increasingly 
has been turned over to an emerging group of professionals called 
hospitalists. Typically, hospitalists are internists who devote the ma' 
joricy' of their clinical effort to caring for inpatients.* '^ Outpatient 
responsibilities, if they exist, are minimal. The defining character' 
istic of hospitalists is the “hand-off” cycle: a primary care provider 
admits the patient to the designated hospitalist, who provides in- 
patient care and then sends the patient back to the primary care 
provider upon discharge. 

Many early arguments for hospitalists centered on the positive 
impact that this model of care would have on resource utilization 
and patient outcomes. Indeed, early data suggest that care by hos- 
pitalists is associated with reductions in length of stay, lower read- 
mission rates, and improved resource utilization, and there seems 
to be little negative impact on patients’ satisfaction.*’ Among the 
issues that have not been fully addressed is the role that hospitalists 
play in medical education. Potential issues have been discussed, 
such as a diminished sense of autonomy among residents,*''’*' per- 
haps counterbalanced by increased satisfaction and better super- 
vision of patients.*'*’ For other issues, such as the presence of 
hospitalists in academic medical centers and their teaching re- 
sponsibilities, few data have been presented. Historically, the cor- 
nerstone of both undergraduate and graduate medical education has 
been inpatient-based. Though ambulatory care training has been 
emphasized in recent years, the inpatient wards remain the major 
site of clinical teaching. If beds and/or wards are being turned over 
to hospitalists, it is important to determine the impact this may 
have on educational programs. 

The purpose of this study was to address such educational issues. 
Separate questionnaires were mailed to the chairs and program di- 
rectors of all internal medicine training programs in the United 
States to learn ( 1 ) how many programs have hospitalists on staff, 
and to gain information about related census issues (e.g., number 
hired, plans for future hires); (2) the role of hospitalists in teaching 
activities, and (3) attitudes regarding the roles hospitalists play in 
general and their role in teaching, specifically. 

Method 

The questionnaire was sent to all chairs and program directors of 
accredited internal medicine training programs who were identified 
in the spring of 1999 using the 1998-1999 AALA Gradiuite Medical 
Education DirecLory. This process resulted in a roster of 106 chairs, 
382 program directors, and 22 individuals who filled both roles. 
Three separate questionnaires were developed. Content was defined 
by the study team, raking ideas fn^m current literature as well as 
discussions that had taken place locally in the course of developing 
a hospitalist service in 1998. Draft insmimenrs were revised nu- 
merous times to improve clarity' and breadth, after piloting them 
with faculty. 

The questionnaire for chairs was brief (eight questions) and fo- 
cused on asking whether hospitalists were employed at the sites and 
if so, defining how long they had been there and their training 
backgrounds and responsibilities. A more extensive questionnaire 
was developed for prograut directors. In addition to general program 
descriptions, directors were asked whether hospitalisLs were em- 
ployed and provided 12 attitude statements about hospitalists to be 
answered on a five-point Likert scale from “strongly disagree” to 



“strongly agree.” For programs that had hospitalists on staff, a set 
of questions focused on teaching responsibilities, participation in 
other educational activities, and 13 more arcirude statements about 
hospitalists’ roles and their impact upon teaching. The question- 
naire for the few individuals who were both chairs and program 
directors was a collection of the unique items from the other two 
versions. The first mailings were sent in April 1999, A second mail- 
ing with a new copy of the instmmenr was sent in June 1999. 
Because the response rate for program directors was low, for the 
third mailing, items asking about activities at each training site 
were omitted to reduce the respondents’ burden, thus shortening 
the questionnaire from four to two pages. The third mailing was 
sent in August 1999. The final response rates were 78,3% (n =83) 
for chairs and 57-6% (n = 220) for program directors. The eight 
responses from the 22 chairs-program directors were added to both 
data files, for analytic sample sizes of 91 and 228. 

Analyses of the responses focus on description. We used standard 
univariate statistics (frequencies and percentages) to characterize 
the sample. To test for differences hetw'een programs that did and 
did not respond, between responses to the long and short sur\'ey 
forms, and between programs that did and did nor employ hospi- 
talists, we used chi-square, t-tests, and the Wilcoxon tw'o-sample 
rest. 

Results 

Rc’s/xmdencs and No7i'rcsl)ondents. Program characteristics avail- 
able from the 1998-1999 AM A Graduate Medical EducauoJi Direc' 
teny allowed limited comparison of non- respondents with respon- 
dents. Overall, the program sizes were the same for respondents 
(mean = 52.9, SD = 33.2) and non-respondents (mean = 55.8, SD 
= 36.9, p = .40). The respondents and the non-respondents did not 
come from different regions of the country (p = .086). 

There were few significant differences between the responses of 
the 130 program directors who responded to the long form of the 
program directors’ questionnaire and the 90 who responded to the 
short one. For example, there was no difference in numbers of in- 
patient training sites (p = .63) or numbers of categorical residents 
at the PGYl level (p = .35). There was no difference in the per- 
centages who had hospitalists (p = .80), were planning to hire 
hospitalists (p = .59). or had rejected the idea of having hospitalists 
(p = .47). Those responding to the sht)rt form had mc>rc favorable 
attitudes with respect to one of rhe 13 attitude items. 

Chairs. Overall, 50 of the chairs (55.6%) reported that hospital- 
ists were employed at one or more of their training sites. The num- 
bers of hospitalists per institution ranged from one to 15, with a 
median of four. (The total number of hospitalises employed by the 
44 programs that reported having them was 206.5.) Twenty-nine 
(64.4%) planned to hire more hospitalists. The tenure of hospitalist 
programs was a median of two years, with a range of 0.5 to 7.5 
years. Nearly three fourths of the hospitalists (71.9%) had com- 
pleted residencies in internal medicine, 4.4% had completed gen- 
eral internal medicine fellowships, and 1 1.4% had completed sub- 
specialty’ fellowships. 

The reported duties of the hospitalists were quire variable. The 
numbers of months of inpatient responsibilities ranged from one to 
1 2, w'iih a median of eight. Tlic percentages of the responding 
department chairs rcpi-irting other rc.sponsibilities for hospitalists 
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Table 1. Attitudes of IM Program Directors Regarding Hospitallsts, and Comparison of Attitudes of Those Whose Programs Do 

and Do Not Employ Hospnalists, 1999 





All Program Directors 




Comparison of Two Groups 








(N = 217) 




of Program Directorsf 












Those with 


Those without 






04 

/O 


% 


% 


Hospitalists 


Hospitalists 




Questionnaire Statement 


Disagree 


Neutral 


Agree 


(/?= 109) 


(/? = 108) 


P 


Hospitalists are more familiar with practical aspects of inpatient care than 
other physicians attending on inpatient services. 

Hospitalists need additional training beyond standard internal medicine 


23.9 


17.7 


58.4 


3.55 


3.25 


.04 


residency. 


53.1 


27.1 


19.8 


2.57 


2.77 


.39 


Hospitalists provide better inpatient care than other general internists 
attending on inpatient services. 

Hospitalists provide better inpatient care than subspecialists attending on 


33.3 


33.3 


33.3 


3.03 


2.94 


.61 


inpatient services. 


33.0 


27.3 


39.7 


3.13 


2.90 


.13 


Patients of hospitalists are satisfied with the inpatient care they receive. 


3.5 


42.6 


54,0 


3.81 


3.24 


.0001 


Patients of hospitalists are satisfied with the outpatient care they receive. 


4.5 


61.6 


33.8 


3.35 


3.23 


.18 


The use of hospitalists disrupts continuity of patient care. 


25.8 


20.1 


54.1 


3.21 


3.51 


.17 


The use of hospitalists improves patient care. 


11.1 


47.1 


41.8 


3.49 


3.18 


.005 


The use of hospitalists is good for hospitals financially. 


2.4 


40.1 


57.5 


3.83 


3.52 


.008 


The use of hospitalists improves the training of residents. 


17.7 


45,5 


36.8 


3.45 


2.99 


.0003 


The use of hospitalists improves the training of medical students. 


18.8 


46.6 


34.6 


3.38 


3.01 


.004 


1 expect that use of hospitalists will increase over the next few years. 


3.8 


11.0 


85.2 


4.16 


3.86 


.0025 



' Responses v/ere made on a five-point scale: 1 = strongly disagree. 2 = disagree. 3 = neutral. 4 = agree, and 5 = strongly agree, in reporting the overall responses, the 1 and 2 
categories were collapsed, as v/ere the 4 and 5 categories. 

t Group comparisons were made between program directors from programs with and v/ithout hospitalists. Scores on the five-point scale were compared with the Wllcoxon two- 
sample test. 



were: 55.3% reporting hospitalists with outpatient practices (with 
a median of 10S(> full-time equivalent); 77.8%, medicine consul- 
tation; 46.8%, clincial pathways/d iseasc management development 
or implementation; 31.9%, quality assurance; 27.7%, medical di- 
rectorships; and 17.4%. insurance company or managed care liai- 
sons. Of note, 53.2% required academic productivity for promotion. 

Of the programs that did not have hospitalists, 37.1% planned 
to hire them in the future and 16.1% had considered but rejected 
the idea. 

Program Directors. Overall, 50.5% of the responding programs 
employed hospitalists. As shown in Table 1 , many program direc- 
tors’ attitudes about hospitalists were positive. For example, the 
majority agreed that hospitalists are more familiar with practical 
aspects of inpatient care, that they are good for the hospital finan- 
cially, and that patients of hospitals arc satisfied with their inpatient 
care. Most disagreed that they needed more training beyond that 
gained in an internal medicine residency. On the other hand, most 
also thought that use of hospitalists disnipted the continuity' of 
patient care, and only one third agreed that hospitalists provide 
better inpatient care than other general iternists. 

The last three columns of Table 1 shows the means of the Likert- 
scale responses of the program directors with and without hospi- 
talists. Differences were significant, and in the anticipared direc- 
tion, for seven of the 12 attitude statements. 

In addition, respondents of the 109 programs with hospitalists 
were asked whether the hc^spitalists participated in a number of 
different activities related to education. Nearly all participated in 
the teaching of medical .students (80.2%) and residents (84.5%). 
Other educational acrivities in which they participated included 
attending physicians’ rounds (74.7%); residents’ reports (58.6%); 
management conferences (53.5%); curriculum development 
(55.6%), and journal club (48.5%). Specific topics taught by the 
hospitalists included cost-effective care (57.1%); resource utiliza- 
tion (57.1%); health economics (42.9%); clincial pathway.s/discasc 
management (38.8%); and insurance principles (26.5%). In nearly 
all programs (78.1%), students and house.sraff evaluated the hos- 
pitalists. 



Table 2 lists additional attitudes of the program directors who 
employed hospitalists, especially their perceptions of hospitalists’ 
role in and impact on teaching activitic.s. Over 70% agreed that 
that hospitalists are viewed as good educators and are respected. 
The majority' thought that hospitalists have led to improved 
housestaff supervision and are more accessible to houscsiaff than 
other teaching faculty. Tliey were less certain that hospitalists had 
an impact on housestaff ’s considerations of lengths of stay and costs 
of tests and procedures, or that the housestaff had learned to order 
fewer tests and procedures. 



The results of the surv'cys reported above show that hospitalists 
have a presence in both undergraduate and graduate medical edu- 
cation: at least half of the responding training programs employed 
hospitalists, who in most cases played roles in teaching students 
and/or residents. Attitudes expressed by the total sample of program 
directors were generally positive, naturally more so for those rep- 
resenting programs with hospitalists. In particular, program direc- 
tors from programs with hospitalists were especially complementary 
about the hospitalists' familiarity with practical aspects of care, 
their positive financial impact on the hospitals, their positive im- 
pact on patients’ satisfaction, and the improvements in residency 
training. On the other hand, most programs had only a few hos- 
pitalists, they had had them for only one or two years, the numbers 
of months in inpatient responsibility ranged from one to 12, and 
their involvement in a variety of specific teaching activities was 
varied. Given this variation, it might not he feasible to characterize 
“the” teaching role of hospitalists. 

This study has some limitations. The response rate fiir program 
directors was relatively low, although we found no evidence of bias. 
Second, wc are able to create a composite picture of what hospi- 
talists do, hut w'e did not collect parallel data regarding mm-hi'h- 
pitalisr attending physicians. Thus, wc are missing a piece of rhe 
total picture. Third, wc did not ask for detailed data regarding the 
teaching activities of the hospitalists, e.g., what does participation 
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Table 2. Attitudes Held by 109 Directors of IM Programs That Employ Hospitatists 



Questionnaire Statement 


% 

Disagree* 


% 

Neutral 


% 

Agree* 


The hospitalists at any institution are viewed as good educators. 


3.2 


18.3 


78.5 


Housestaft supervision has improved with the addition ot hospitalists. 


17.8 


30.0 


52.2 


Houseslafi have adequate exposure to physician-scientist faculty. 


21.1 


25.0 


57.6 


Hospitalists are more accessible to housestaft than other teaching faculty. 


21.1 


16.7 


62,2 


Housestaft are more comfortable managing inpatients with hospitalists than with other general medical attendings. 


27.8 


31.1 


41.1 


Housestaft are more comfortable managing inpatients with hospitalists than with subspecialist attendings. 


35.6 


36.7 


27.8 


Hospitalists are respected at my institution. 

Housestaft who have worked with hospitalists consider length of stay in their management plans more than they did 


6.4 


21.3 


72,3 


previously. 

Housestaft who have worked with hospitalists consider cost of tests and procedures when determining their 


20.0 


42.2 


37.8 


management plans more than they did previously. 


18.7 


44.0 


37.4 


Housestaff-perceived autonomy has been reduced by the use of hospitalists. 


47.3 


29.7 


23.1 


The use of hospitalists has reduced the inpatient teaching responsibilities of the other faculty physicians. 

The use of hospitalists as teaching attendings has resulted in less interaction betv/een housestaff and primary care 


31.9 


12.1 


56.0 


providers. 

Housestaff who have worked with hospitalists have learned to order fewer tests and procedures than they did 


44.9 


25.8 


29.2 


previously. 


22.5 


52.8 


24.7 



* Responses were made on a live-point scale: 1 = strongly disagree, 2 s disagree, 3 = neutral, 4 = agree, and 5 = strongly agree. The 1 and 2 categories v/ere collapsed, as were 
the 4 and 5 categories. 



in curriculum development mean? Nevertheless, to our knowledge 
this is the first study to detail the presence of hospita lists and prO' 
vide an over\dew of their teaching activities in teaching institu- 
tions. 

Overall, even though their numbers were small, in at least half 
of the U.S. internal medicine training programs thar responded, 
hospitalists were present and played roles in teaching. Given the 
amount of time they spend in inpatient services, they have wide- 
spread exposure to learners on all levels. This visibility makes hos- 
pitalists as a group an ideal target for faculty development focused 
on teaching methods and feedback skills. Although the respondents 
generally viewed hospitalises as excellent teachers who had led to 
improved training for residents, hospitalists as teaching faculty 
should be evaluated compared with faculty involved in teaching 
on traditional services. 

Much of the justification for hospitalists draws on arguments that 
they should be able to save money by reducing test ordering and 
lengths of stay. If they are really succeeding in these areas, as cur- 
rent data suggest they might he,^"'‘ it is logical to assume they could 
affect residents’ behaviors via modeling and/or direct teaching of 
optimal management strategies. The fact that so few program di- 
rectors believe hospitalists will have an effect on residents’ future 
behaviors in the areas of ordering and effective management is 
somewhat surprising and deserves more study. Examining residents’ 
behaviors and attitudes in terms of the relative amounts of exposure 
they have to services led by hospitalists would yield useful insights. 
An additional contriburion would be an understanding of how, as 
a group, hospitalists’ teaching activities and outcomes differ from 
those of their peers, and whether there is observable variation in 
the reasons hospitalists’ services were developed (e.g., cosr effi- 
ciency, excellence in inpatient teaching, as a “safety valve” for 
overburdened teaching services). Fi/ture studies will be most helpful 



if they are aimed at defining the unique contributions attributable 
to hospitalists. 

CoiTcspondcncc: Judy A. Shea, PhD. University of Pennsylvania, 1232 BKxikley 
Hall. 423 Guardian Drive, Philadelphia, PA 19104'6021; e-mail: (shea/a^UKi.ined. 
npenn.edu). 
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DuaUdegree MD-MBA Students: A Look at the Future of Medical Leadership 

WINDSOR WESTBROOK SHERRILL 



In an increasingly turbulent medical care system, business training 
is one way doctors and medical students are seeking to redefine 
their ability to lead and wield influence. Changes in the health 
care system have fostered the need for physician executives with 
business training who can sen'e as liaisons between administrative 
and clinical personnel. As the development of integrated delivery 
systems has combined clinical and administrative functions, the 
roles of physician executives have increased, as well as the demand 
for related training of physician leaders. Growth in the number of 
physician executives is expected to continue as such individuals 
demonstrate their ability to facilitate provider-physician relations 
and lend unique expertise and perspective in the healdi care de^ 
livery system.’ 

The transition from clinical roles to administrative functions can 
be challenging for physician executives."^ Moving into administra- 
tive roles presents challenges different from those inherent in med- 
ical training and practice/ If the physician manager is to be con- 
sidered an effective asset to an organization, the new role requires 
distinct shifts in thinking, philosophy, attitudes, and behavior.^ Be- 
cause traditional clinical training of physicians contrasts with man- 
agement training and functions, few physicians are prepared for the 
requirements of management roles.^ 

Several studies of leadership and management have found that 
leaders’ personality' and behavioral characteristics are reliably pre- 
dictive of group performance.’*^ Leadership success is associated 
with interpersonal ability, group-oriented behaviors, empathy, bold- 
ness in times of uncertainty, internal locus of control, and confi- 
dence.’**'^ Leadership theory suggests that effective leaders are able 
to identify' and actively respond to changes in a profession, orga- 
nization, or situation.® 

Although a growing number of practicing physicians have ob- 
tained business (MBA) degrees, relatively few educational initia- 
tives have been focused on business and management training 
within the medical school program.’’ In response to this demand, 
a limited number of medical schools are offering dual-degree pro- 
grams in medicine and business. Established through cooperative 
agreements between medical and business schools, these programs 
offer a variety’ of arrangements through which medical students can 
obtain business and clinical training concunencly. 

Students enrolled in dual-degree programs make up an important 
group for exploratory research. If dual-degree medical students ex- 
hibit characteristics associated with successful leaders, this might 
indicate their ability to function as effective leaders in both clinical 
and management roles. Within the traditional medical school en- 
vironment, it is possible that this group is reshaping individual 
beliefs about physician roles and the fit between clinical and ad- 
ministrative functions. Their career goals and the factors influenc- 
ing these students to seek business training might provide an in- 
dicator of the leadership styles and roles of future physician 
executives. 

Method 

According to Peterson’s Guide to Graduate Programs in Business, Ed' 
ucarion, Health and Law, there were eight medical schools that of- 
fered dual-degree MD-MBA programs in 1997. Of the eight 
schools, six had coordinated MD-MBA programs for fl^jh|ph pro- 
gram directors were designated and students followed a defined path 



in course work. Students in these programs were selected for inclu- 
sion in the present study. 

Of the six dual-degree programs, one MD-MBA program could 
be completed in four years by using summers for course work. The 
other programs that were examined required five or six years of 
study. Each of the dual-degree programs had some component of 
an administrative internship for the students. 

Survey and interview measures were used to analyze students at 
the six medical schools offering MD-MBA programs (n - 87): 
Bowman Gray School of Medicine, Jefferson Medical College, the 
University of Chicago, the University of Pennsylvania, the Uni- 
versity of Illinois at Urbana-Champaign, and Tufts Medical 
School. The 87 students who were enrolled in dual-degree programs 
were su^^'eyed; a control group of traditional medical students was 
also sur\'eyed (n = 115). Traditional medical students at each siie 
were selected based on a set of characteristics matched with those 
of the dual-degree students. The data were also compared w'ith the 
findings from a national surv'cy of graduating medical students com- 
piled by the Association of American Medical Colleges. Forty* of 
the 87 dual-degree medical students surveyed were also interviewed. 
The interviews were analyzed using Ethnograph, and survey data 
were analyzed using SPSS. 

To assess whether they might overcome the barriers between 
clinical and management roles, dual-degree medical students were 
compared with traditional medical students on dimensions that 
were selected for their potential to indicate leadership ability. Dual- 
degree students were also asked about their career goals and the 
factors that had influenced them to seek business training. 

Results 

The response rare for the sur\'ey of the 87 MD-MBA students was 
85%; the response rate for the 115 medical students in the control 
group was 69.6%. A major finding of the study is that there are 
indeed significant differences betv;een dual-degree and traditional 
medical students on a number of dimensions that relate to career 
plans, leadership, motivation to be leaders, and confidence. 

One scr of questions was intended to assess smdents’ beliefs, coO" 
cems, and perceptions about the future of medical practice. The 
set of questions was designed to compare the attitudes of dual- 
degree and traditional medical students regarding the changes in 
health care and the evolution of the physician role. The students 
were asked to rank statements such as “job opportunities for phy- 
sicians arc increasingly limited” and ‘‘the health care financing sys- 
tem is too burdensome on physicians.” Answers to these questions 
provided a composite index of students’ perceptions, including at- 
titudes concerning the role of physicians in society. Both survey 
resp<mses and interview feedback support the hypothesis chat dual- 
degree students arc very conscious of the changing nature of the 
medical care system and the need to transform physician roles. 
Dual-degree students w»ere less likely to feel negative about changes 
in job opportunities for physicians or about regulatory or financial 
constraints in medicine. The data also support the hypothesis that 
dual-degree students are influenced to obtain business degrees be- 
cause of concern about the changing job market for physicians. 

The members of both the dual-degree and control groups were 
asked what they expected to cam five and ten years after complet- 
ing residencies. The MD-MBA g-.oup had an expected mean in- 
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come after five years of $167,986, while the MD students had a 
mean of $132,208. The means of the uvo groups were significantly 
different; t (147) = 3.66, p < .0001. 

As an indicator of their career plans and aspirations, duahdegree 
students were asked to rank activities according to how they would 
feel about them as primar^^ job responsibilities. Job responsibilities 
ranged from CEO of a for-profit hospital to medical director of an 
inner-cit^’ health clinic. Job responsibilities were provided as indi- 
cators of the types of pH)sitions these students might desire, partic- 
ularly related to their tendencies toward more altruistic positions 
compared with activities that might be traditionally associated with 
the “business” of medicine. The job activities were organized into 
subgroups based on activity type and were developed to reflect items 
that might indicate students* altruistic versus economic philoso- 
phies. The first group included medical director of an HMO, CEO 
of a biotechnology- company, medical director of an insurance com- 
pany, and chief of staff of a for-profit hospital system. In contrast 
to the first group of activities, the second group included activities 
traditionally associated with the public -services arena. This group 
included medical director of an inner-city’ health clinic, chief of 
staff of a rural hospital, medical liaison for the World Health Or- 
ganization, and deputy director of the state hoard of health. 

Tlie combined group ratings were compared using r-tests, and 
the subgroup scores were significantly different. The mean for the 
“business” subgroup was 1.83; the “public serv'ices" subgroup mean 
was 2.26. The dual-degree students considered the business group 
significantly more appealing, t (105) = 3.02, p < .05. 

Both dual-degree and traditional medical students were asked to 
select their preferences from a list of career activities, including 
such things as full-time faculty appointment, private clinical prac- 
tice, and administrative duties. Seventy--eight percent of the dual- 
degree students expressed an interest in a combination of clinical 
and administrative duties; 13.5% of the dual-degree students 
planned administrative jobs with no clinical practice. Several dual- 
degree students inter\'iewed stated that they planned to forego res- 
idency training to initiate careers in the private sector. 

Both traditional medical and dual-degree students were asked 
whether they wore confident that they would have necessary- clin- 
ical and administrative skills when they graduated from their re- 
spective educational programs. These results were compared with. 
corresponding information from the national database of graduating 
medical students as well as from the control group of students. 
Dual-degree students expressed little doubt in their clinical or ad- 
ministrative abilities and were significantly more confident than 
were their medical student counterparts (clinical .-ikills — t (151) = 
6.409, p < ,(X)01; administrative skills — i (150) = 2.913. p < .01). 

Confidence in one’s ability to influence others and the environ- 
ment is associated with leadership.'^ Yet, misplaced confidence can 
lead to poor decision making for both clinicians and managers. It 
is interesting that the dual-degree students were more confident 
than were the traditional medical students with respect to both 
clinical and administrative skills. Although a positive self-concept 
may he beneficial, rhe students’ confidence has implications for tlte 
future of medical care. The potential overconfidence of the MD- 
MBA students needs to he understood and managed to av-oid ro- 
tential disastrous effects; confidence is a positive attribute for Iciid- 
ers and managers, but overconfidence can be a harrier to effective 
decision making. Individuals who are overconfident might fail to 
seek consensus among groups and lack the discipline to seek out 
information in solving clinical and management problems. 

Students’ influences and motives for choosing the dual-degree 
programs, as well as their career plans, provide an indication of the 
roles they will play in the delivery' system. It was hypothesized that 
dual-degree students arc motivated to seek business degrees because 
t)f a desire to he leaders in the health care delivery system. Their 
career goals and plans illustrate such motivation. In response to 
surv-ey questions related to their reasons for seeking business edu- 
cation, the students rated mo.st highly factors slicK os career op- 



portunities, opportunity for innovarion, opportunity to be a leader 
in medicine, and opportunity to make a difference in medicine. 

Discussion and Implications 

A new model of physician executives is emerging from dual-degree 
programs. Young physicians are making decisions not only at the 
beginning of their medical careers, but in most cases, for these 
students, at the beginning of their medical education. It is possible 
that dual-degree programs will produce individuals equipped to take 
leadership roles in managerial assignments early in their careers, 
perhaps even in residency programs. 

This study underscores an important policy question for the 
health care system and medical education. The challenges facing 
the health care system are both economic and equity-related. Es- 
calating costs are combined with serious problems of underserved 
populations. Some of the most significant management challenges 
in the delivery system relate to the challenge of how to provide 
equalized distribution of health care serv'ices as well as how to im- 
prove access to basic health care services. Physicians with business 
training are needed in all areas, not only in the areas of high tech- 
nology and high costs. The study findings suggest that dual-degree 
programs are attracting students primarily with business interests. 
Eleven of the 40 students interviewed had full-time and significant 
work experience in areas such as investment banking and health 
care consulting prior ro matriculation in medical school. Students 
interested in working with public health needs and underserved 
populations arc not well represented in the dual-degree programs. 

Are the programs too narrowly focused on dealing with the busi- 
ness of health care delivery’? As early adopters of an innovative 
medical education initiative, dual-degree students provide a unique 
perspective on the direction of medical leadership and alternatives 
to traditional medical careers. A key finding of the study suggests 
that this cohort wants to direct hospital and insurance companies 
more than they want to work in the public sector. This indicates 
that the motivation for those students to seek dual-degree programs, 
as well as motivating factors behind program development, were 
related to business and high-technology’ settings. The students’ job- 
activity preferences and income expectations provide support for 
this conclusion. Traineeship experiences and mentors provided by 
the dual-degree programs may need to he modified to address these 
trends. 

Physician executives are likely ro have a pivotal role in the un- 
certain future of health care.'' The management of health care 
resources requires a combination of skills that balance the princi- 
ples of economics, finance, and accounting with patient and pop- 
ulation health needs. Dual-degree medical education programs can 
help develop physician leaders who can blend clinical and man- 
agement skills into an effective vision for the future of health care 
delivery. 

The authors of In Search of Physician Leadership observ’e that phy- 
sicians are entering management in increasing numl'ers and at in- 
creasing levels of responsibility, a trend they assert portends well 
for the medical profession and the health care system.'^ Medical 
education programs combining busine.ss and clinical education arc 
training students who can contribute to this ptJsitIvc trend. As one 
student stated, “(Dual degrees] can bring valuc.s of medicine into 
the business world. It used to be totally different, hut now things 
are beginning to merge. \Vc can do the host for both fields.” 

This is an exploratory' study of an innovation in medical edu- 
cation. Tlie early stages of this field offer the opportunity to step 
hack and consider the professional identity desired among dual- 
degree medical students. Dual -degree programs arc producing a pro- 
toty'pe of physician executive whose Training is remarkably different 
from that of traditional phvsicians. The data suggest that there is 
an interesting range of expectations among dual-degree medical stu- 
dents and the careers that they anticipate. The interests and career 
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preferences of the students reveal several trends of concern, but 
also suggest that these programs can make an important contribu' 
tion to the health care system. 

Correspondence; Windsor Westbrook Sherrill, MHA, PhD, Assistant Professor, De- 
paitmeni of Public Health Sciences, 525 Edwards Hall, Clemson University, CIcmson, 
SC 29632; C'mail: (Ws/imiQderjuon.edn)- 
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A Preliminary Analysis of Different Approaches to Preparing for the USMLE Step 1 

RAJ A. THADANI, D, B. SWANSON, and ROBERT M. GALBRAITH 



Prior to taking the United States Medical Licensing Examination 
(USMLE) Step 1, medical students commonly spend large amounts 
of time studying on their own or in groups. They also may partic' 
ipate in test^preparation activities offered by their schools. In ad- 
dition, there is anecdotal evidence indicating that medical students 
increasingly purchase “board prep” publications and sign up for 
commercial coaching courses that may last several weeks and cost 
thousands of dollars. 

The effectiveness of alternate approaches to preparing for Step 
1 is unknown, though research on other high-stakes exams suggests 
that exam performance may be improved by coaching courses.’ 

In this article, we report the results of a recent surv'ey of the strat- 
egies used by medical students to prepare for Step 1 , and present 
our preliminary analyses of the relationships between preparation 
strategies and test scores. 

Method 

Participants in this study were chosen by taking a random sample 
of first-time Step 1 takers from U.S. and Canadian allopathic med- 
ical schools. In a survey following the June 1998 paper-and-pencil 
administration of Step I, participants were asked about their study 
habits in relation to: the number of hours spent studying for Step 
1 each week; the types of materials they had used when studying; 
the number of weeks of full-time study; and if applicable, a series 
of questions about coaching course(s). Survey responses from each 
participant were matched to his or her USMLE Step 1 scores, Med- 
ical College Admission Test (MCAT) scores, and undergraduate 
science grade-point average (GPA). 

The survey questionnaires was mailed to a random sample of 
3,958 first-time takers of Step 1. A total of 1,650 responded, but 
because MCAT scores and GPAs were not available for all exam- 
inees; only 1,217 were included in the analysis reported here. Ini- 
tially, information about all the test-preparation materials and 
^.ourses (methods) was examined descriptively. Then, to evaluate 
the usefulness of the various preparation methods, ordinary least- 
squares (OLS) regression analysis was employed. The variables in- 
cluded in the equation fell into four sets. 

■ Pre^matriculadon Characteristics. MCAT total score (sum of bio- 
logical sciences, physical sciences, and verbal reasoning scores) 
and adjusted science GPA.’ Scores were calculated using the fol- 
lowing formula; adjusted science GPA = undergradttate science GPA 
X selecriwiy Index 1 ,000, where the selectivity index is equal 
to the mean Scholastic Aptitude Test score for students at the 
undergraduate school, attended. This adjustment controls, to a 
degree, for the variation in grading stringency across undergrad- 
uate institutions. 

■ Stifdy time. Weeks of full-time study and hours studied per week. 

■ Preparation methods. Use of USMLE materials, lecture notes, 
course syllabi, note-taking service, textbooks, commercial study 
guides, school materials, group study. 

■ Coaching course. A 0/1 dummy code that reflects participation in 
a coaching course plus an interaction term equal to the product 
of the dummy code and the MCAT total score. 

Results 

Descriptive Information about Preparation Methods. Analysis of the 
responses to the survey showed that 98% of the respondents had 



used commercial guides. In contrast, only 70% had used the official 
USMLE General Instructions, Content Description and Sample Items 
Booklet, which was provided to all examinees with application ma- 
terials. Other methods had been used less often: lecture notes 
(39%), note-taking services (6%), textbooks (44%), course syllabi 
(21%), preparation materials provided by the school (25%), and 
group study (25%). 

The survey responses also indicated that 23% of the respondents 
had enrolled in a commercial coaching course. When asked about 
the emphasis of the coaching course, 57% of these respondents 
reported that it had focused on learning Step 1 content, but also 
included instruction on test-taking strategies. Twenty-eight percent 
indicated that the emphasis of their courses had been entirely on 
learning Step 1 content. Less than 5% reported that their courses 
had spent a majority of the time on test-taking strategies. Exam- 
inees were asked to subjectively rate the value of their courses on 
a scale of 1 to 5, with 1 being not helpful and 5 being very’ helpful. 
The examinees’ average rating was 3.2. 

Analysis of mean scores attained by users versus non-users of 
certain study methods showed significantly better performances 
among examinees who had used the USMLE general instructions 
booklet, textbooks, course syllabi, and study materials provided by 
the medical school. Examinees who had enrolled in coaching 
courses received on average lower scores than those who did not 
enroll (Table 1). However, it should be noted that the examinees 
who had enrolled in coaching courses had significantly lower 
MCAT scores (28.8 versus 30.2) and adjusted science GPAs (3.70 
versus 3.88). Differences in mean scores between users and non- 
users of other test preparation methods were smaller and statistically 
insignificant. 

Descriptive Information about Study Time. In order to examine 
the effects of preparation time, the examinees were asked how 
many weeks they had spent studying for the exam full time and 
how many hours per week they studied during that time. Since 
these examinees were first-time takers from U.S. medical schools, 
the vast majority had recently completed their second year of med- 
ical school; thus, the number of weeks of full-time study was de- 
pendent to some degree on when they had completed their course- 
work. Analysis indicated that the examinees had studied for an 
average of 5.8 weeks and for 53 hours per week. Scores were pos- 
itively correlated with weeks of full-time study. However, Step I 
performance showed little relationship to the number of hours of 
study per week. 

Regression A»ial>’sis of Factors Influencing Step I Scores. A series 
of regression analyses were run ro examine the relationships be- 
tween Step 1 performance and exam-preparation strategies. The 
first analysis, the “full model,” included all four sets of predictors 
described previously: pre-matriculation characteristics, study time, 
preparation methods, and coaching course participation. Then, for 
each set, two regression equations were estimated: the first included 
just the predictors in the set, and the second included all predictors 
except those in the target set. The for the first equation provides 
an upper bound on the variance explained by the set; the difference 
between the for the full model and the for the second model 
provides a lower bound (the proportion of variance uniquely pre- 
dicted by the set). The full model explained 33.6% of the variance 
in Step 1 scores. Table 2 provides the upper and lower bounds on 
the predicted by each set. 
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Table 1. Mean USMLE Step 1 Scores by Test-preparation Approach Used 



Approach 


Approach 

Used 


Approach 

Not 

Used 


p Value* 


General instructions booklet 


Mean score 


222.91 


219.90 


.009 


No. students 


851 


366 




SO 


18.69 


18.43 




Lecture notes 


Mean score 


222.64 


221.61 


.349 


No. students 


468 


749 




SO 


18.69 


18.84 




Note-taking service 


Mean score 


224.12 


221,88 


.3Z7 


No. students 


67 


1150 




SO 


18.70 


18.62 




Textbooks 


Mean score 


224.90 


219.70 


<.001 


No. students 


540 


677 




SD 


18.28 


18.64 




Course syllabi 


Mean score 


226.66 


220.80 


<.001 


No. students 


250 


967 




SD 


18.29 


18.56 




Commercial books about USMLE 


Mean score 


222.22 


213.28 


.010 


No. students 


1188 


29 




SD 


18.43 


23.57 




School materials about USMLE 


Mean score 


221.43 


221.43 


.060 


No. students 


304 


913 




SO 


18.86 


18,52 




Study in groups 


Mean score 


221.50 


222.18 


.576 


No. students 


309 


908 




SD 


18.48 


18.68 




Coaching course 


Mean score 


217.38 


223.38 


<.001 


No. students 


279 


938 




SD 


19.83 


18.05 





* From independent-samples t-test. 

Results indicated that pre- matriculation characteristics ac- 
counted for nearly all of the explained variance. Study time ex- 
plained a small amount of variance, with both weeks of study and 
hours of study per week having a small positive influence on scares. 
However, neither coaching courses nor preparation methods had a 
significant influence on scores. Further, the term measuring the in- 
teraction between coaching course enrollment and MCAT scores 
was not significant. 

In order to focus on preparation methods in more detail, we 
examined each individual method separately, controlling for the 
pre-matriculation characteristics. The results showed that use of 
textbooks had a significant effect on Step 1 scores, although the 
effect was fairly small (1.9 points). Tlic other preparation methods 
did not significantly affect scores. 

Discussion 

Studies of the possible effects of coaching courses have been re- 
ported for several post-secondary exams, hut have been notably 



Table 2. Regression Results Concerning the Relationships between 
USMLE Step 1 Performance and Four Sets of Predictors* 



Variable Set 


R2 


Unique 


Full model 


.335 


— 


Pre-matriculation (MCAT, Science 6PA) 


.324 


.282* 


Study time 


-008 


.004* 


Study methods 


.035 


.004 


Coaching courses 


.018 


.001 



*p < .05. 



absent for the USMLE; the only studies we could locate were un- 
dertaken in the 1970s at single medical schools for the NBME Part 
I exam. In the first of these studies, Scott et al.® found significantly 
higher scores in coached examinees in only one of the three years 
studied. They found that coaching offered greater benefit to stu- 
dents with lower basic science GPAs (first two years of medical 
school coursework) than it did to students with higher basic science 
GPAs. Students were sur^'eyed as pan of that study, and the vast 
majority of students thought the course had been beneficial. Both 
the authors and the students surveyed cited the relevance of the 
course content to Part I and the organization of the material as the 
most valuable features of the course. In contrast, Lewis and Kuske" 
reported that after controlling for the examinees’ basic science 
GPAs, commercial review courses had no detectable effect on 
scores. 

In the present study, the examinees had used several different 
strategies while preparing for the USMLE Step 1. By far the most 
common approach was to use commercially prepared study guides. 
"These had been used by 98% of the survey respondents, indeed by 
more respondents than had used the traditional textbooks they are 
generally required to buy. The use of commercial study guides also 
eclipsed the use of materials prepared or approved by professional 
medical school educators, and eclipsed use of the USMLE publi- 
cation designed specifically for Step 1 preparation. Ironically, the 
results obtained indicate that examinees may benefit by using stan- 
dard texts, many of v/hich they have already purchased. There was 
little or no evidence of achievement of higher scores as a conse- 
quence of using commercially prepared material, controlling for 
pre -matriculation characteristics and other study methods used. 

Perhaps the most Interesting finding in this anal\sis is the limited 
impact of coaching courses on scores. Several caveats must accom- 
pany this finding. First, students who enroll in coaching courses are 
self-sclectcd, in some instances because they are concerned about 
their readiness to take Step 1. This sclf-sclected group may dispro- 
portionately include students in academic trouble at their medical 
schools and those who have been warned that they are in danger 
of failing based on tests given in medical school. Second, time- 
intensive courses may compete with other preparation methods that 
are more etfective or time-efficient. Third, our study lumped to- 
gether all coaching courses, which vary in length, intensity, and 
teaching methods. With these provisos, our findings suggest that 
participation in coaching courses appears to have little effect on 
scores when controlling for educational antecedents, time of study, 
and other preparation methods. 

Last, the reduction in sample size due to survey return rate and 
incomplete data was problematic. Examinees included in the study 
had slightly higher Step 1 scores (220.7 versus 216.1). Tlius, there 
is some evidence to support rhe intuitive reasonableness of selec- 
tion bias. 

0>iTC‘p<.inJcnco: Raj A ThaJani, MA, NBMt, ?750 Market Street, Philadelphia. PA 
19104: e-mail; <nhadani@mail. nhmc.org). 
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• YOU’VE GOT MAIL: DISTANCE EDUCATION 



Moderator: Penny Jennett, PhD 



Effectiveness of Telehealth for Teaching Specialized Hand-assessment Techniques to 

Physical Therapists 

WENDY BARDEN. HOWARD M. CLARKE, NANCY L. YOUNG, NANCY McKEE, and GLENN REGEHR 



Health care reform has changed the focus of patient care from 
primarily inpatient to an increased emphasis on outpatient ser\'ices. 
The reductions in hospital beds, staff complements, and lengths of 
inpatient stays have led to an increased need for early referral of 
patients to health care professionals in the community'- Unfortu- 
nately, a corresponding adjustment of outpatient resources has not 
occurred, resulting in an imbalance of resources and making acces- 
sibility' to the appropriate services in the community extremely dif- 
ficult/ This is especially true for outlying communities, since met- 
ropolitan centers have disproportionately large numbers of health 
care specialists/ ’ In North America, wicl'i its vast geographic areas, 
travel between the metropolitan centers and the rural communities 
is often problematic, creating difficulties for patients needing care 
and for rural practitioners, who experience a feeling of professional 
isolation/ 

Telehcalth technology may provide an economically feasible so- 
lution to these concerns. Telehealth has been defined as the utili- 
zation of telecommunications technology to provide health care 
services and medical information over distance.’ Telehealth has the 
potential to improve services to rural communities by providing 
not only direct telemediated access to clinical specialists for pa- 
tients, but also the opportunity for the efficient training of rural 
professionals in the necessary' specialty care/ 

A broad range of medical specialties has demonstrated the ca- 
pabilities of telehealth to assess patients in remote areas.' Much of 
this research, however, has focused on domains in which visuar'’ 
and/or auditory’ information is sufficient for accurate assessment. It 
is less clear, however, whether telehealth assessment is equally ef- 
fective for specialties where tactile interaction between the patient 
and health care professional is considered critical. For these situa- 
tions the health care professional at the distant sire must be the 
‘'consultant’s hands.” 

There is a parallel in using telehealth for the purposes of clinical 
education. That is, tclchealth may he effective for teaching knowl- 
edge-based topics, but many health profession domains, such as 
physical therapy assessment skills, have tactile components that 
require measurement and analysis. Training for these types of skills 
may challenge the application of telehealch beyond its current ca- 
pabilities. The purpose of this research, therefore, was to determine 
the effectiveness of telehealth for teaching specialized assessment 
skills requiring ‘‘hands on” interaction with patients. 

Method 

Participants. In 1999. a total of 42 physical therapists from two 
Northern Ontario cities agreed to participate. They were stratified 
by city' and were systematically allocated to one of three interven- 
tions to ensure that the groups were balanced according to age, 
graduation year, type of educational format utilized at the university 
where the participants trained, prior hand therapy experience, prior 
telehealth experience, and type of current clinical practice. 

Intervenu m. Three educational formats were used to teach five 
hand-assessment skills: volumetries of the hand; total active move-^* * 
ment of the index finger; joint mobilization of the proximal inter- ‘ 
phalangeal joint of the long finger; grip strength; and two-point 
discrimination of the ulnar nerve. The three interv'entions were 
self study (SS); direct face-to-face teaching (DT); and telehcalth 



teaching (TT). The same information was provided to the thera- 
pists in each of the three formats. However, the manners in which 
this information was transmitted differed across the three formats. 

The therapists assigned to the SS group were provided with a 
package containing written information that they were able to re- 
view over a three-day period. This material outlined how to cor- 
rectly perform each hand -assessment skill based on the guidelines 
established by the American Society' of Hand Therapists. There 
were approximately three pages of information per skill, including 
history, indications, contraindications, technique, and diagrams 
demonstrating performance of the skill. When given the documen- 
tation, these therapists were given instructions to learn the material 
independently in the same manner as they would nomially. 

The DT session involved approximately 3.5 hours of direct con- 
tact with the instructor and was organized such that the instructor 
taught each skill for 15 minutes, providing the relevant information 
as described above and demonstrating the skill using a staiidardiied 
patient. Immediately following the teaching and demonstration of 
each skill, the therapists practiced in pairs for approximately 30 
minutes using each other as the ‘‘patient,” w’ith the expectation 
that when they were not interacting with the instructor they would 
exchange ideas to solve problems and to perfect their performances. 
During the 30-minute practice period each pair also received five 
minutes of direct contact and interactive feedback from the in- 
structor. 

The TT session was identical in format and timing to the DT 
session. To ensure similarity of presentation, the primary’ investi- 
gator of the study was present at both teaching sessions. The pri- 
mary’ difference bet\veen the DT and TT groups was that the par- 
ticipants were located together at one local telehealth site and the 
same instructor from the DT group was located at a second local 
telehealth site that was physically removed from the first. Partici- 
pant-instructor interaction was therefore mediated using an intra- 
city link between two facilities that housed compatible videocon- 
ferencing equipment, thus eliminating the possibility of the direct 
“hands-on” contact with the instructor during the interactive feed- 
back components of the session. 

Evaluation Instruments. A modified objective structural clinical 
examination format was used for both a pre-test evaluation and a 
post-rest evaluation. Each participant performed the five skills con- 
secutively on a single standardized patient, taking up to five 
minutes per skill. All five skills were evaluated by the same ex- 
aminer (a content expert who was blinded to the intervention con- 
dition), with a separate mark given for each skill. Two evaluation 
instruments were used for each skill. First, a five-point global rating 
scale with four domains — knowledge of the technique, the ability 
to perform the technique, instrument handling, and organizational 
skills — was used to assess the underlying characteristics of perfor- 
mance. Anchors were provided for points I (poor, unable to per- 
form), 3 (adequate), and 5 (excellent performance). The global 
score for each skill was calculated as the average of the scores for 
the four separate domains. Pilot work on this global scoring tech- 
nique confirmed intcr-rater reliability (ICC^j “ 0.78-0.91 for the 
five skills) and construct validity (with skill level — novice versus 
intermediate versus expert — accounting for 20-67% of the varia- 
tions in scores for the five skills). As a second measure of perfor- 
mance. the examiner completed a binary’ question addressing com- 
petency for each skill. 
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Procedure, The research was conducced over two fivc'day periods 
one month apart in each of two Northern Ontario cities. For each 
city» the participating therapists individually attended a pre-test. 
These were scheduled for 30 minutes, and ill were completed over 
a t^\' 0 -day period. Following each participant’s pre-test, the partic- 
ipant was given instructions relevant to his or her teaching inter- 
vention. Those in the SS group were given the manual with ap- 
propriate instructions and given a time for their post-test session. 
Participants in the TT and DT groups were told when and where 
to arrive for the instructional session and were given a time for 
their post-test session. All participants were asked to avoid dis- 
cussing the nature of rhe test with other participants prior to com- 
pletion of the pre-test period, and participants from each group 
were asked not to discuss the training material across groups in 
order to avoid contamination of the experiment. At each city the 
TT session was held in the morning and the DT session was held 
in the afternoon of the third day. The post-test was conducted over 
the last two days, with the relative time of the post-test for each 
participant as close as possible to the relative pre-test time to ensure 
almost- identical delays between pre- and post-tests for all partici- 
pants. 

Results 

Performance Scores, The summary- statistics for the performance 
scores for all five skills are presented in Table 1. It is clear that the 
DT and TT groups approached excellent performance on all five 
skills after the interv'ention, whereas the SS group demonstrated 
only adequate performance on three of the five skills and poor 
performance for the two remaining skills. For all five skills, the 
interaction terms from the two-way ANOVAs suggest significant 
differences in the amounts of learning among the groups {F 2 .W val- 
ues ranged from 4.98 to 26.65, for all analyses, p < .01). The sub- 
sequent one-way ANOVA comparing the three groups on the pre- 
test showed no effect for any of the five skills values all less 
than 1.00, as), suggesting that all three groups started at the same 
skill level. However, the one-way .ANOV’A comparing the three 
groups on the post-test show'ed powerful, significant effects of the 
group (F:.w values ranged from 9.96 to 35.06, for all analyses p < 
.01 ), suggesting differences in abilities among the three groups after 
the intervention. A scries of post-hoc Tukey tests demonstrated no 
significant difference between the DT and TT groups but a signif- 
icant difference between the SS group and bodi the DT and TT 
groups, suggesting that the members of the DT and TT groups 
learned equally well, and learned better than did those in the SS 
group. Finally, given the lower post- test scores for rhe SS group, a 
series of paired t-rests was performed on the SS group results to 
determine whether the SS group w^as a worthwhile intervention. 
For only three of the five hand assessment skills was there a sig- 
nificant difference in the pre-test and post-test performance (p < 
0.05), and the difference scores for these those skills are relatively 
.small, raising the question of whether the difference is clinically 
significant (see Table 1). 

Competency Scoiv.s, A similar partem of results occuned with the 
competency scores. The pre-test and post-test percent competency 
scores arc presenteti m Table 2. Chi-square analyses of the pre-test 
competency assessments showed no .significant differences berween 
groups on any of the five skills (x; ranged from 0.20 to 1.03, ns, 
with two being incalculable because all participant.^ were evaluated 
as not competent), suggesting that individuals from each group 
were equally likely to he competent. By contrast, chi-square anal- 
yses of the post-ie.st results revealed significant effects of group for 
all live skills (x? ranged from 8.42 to 24-21, for all analyses, p < 
.01). A series of subsequent chi-square analyses comparing rhe 
mcthc)ds by pairs again showed no significant difference bcrw'een 
the DT and TT groups, but significant differences between the SS 
and TT groups and significant differences between the SS and DT 

A 



groups for all five skills. Finally, a series of subsequent McNemar’s 
tests was performed on the SS group competency results to deter- 
mine whether this interv-ention was able to change the competency 
levels of subjects. For all five skills there was no significant change 
in competency levels for the SS group. 

Discussion 

The primary' purpose of this study was to determine w'hether tele- 
health could be utilized to effectively teach specialized assessment 
skills to physical therapists. This study has demonstrated that 
telehcalth may be used in this capacity. 

The five hand-assessment and treatment techniques that were 
selected for this study all possessed components that would chal- 
lenge the transmission capabilities of telehealth. Three of the skills 
— volumetries, total active movement, and grip strength — required 
the participants to use primarily visual learning skills. All elements 
of these three skills are easily learned by watching a demonstration 
or studying written material. Therefore, it was of no great surprise 
that these three skills were successfully taught via telehealth. What 
was somewhat surprising w^re the low pre-test performance and 
competency scores for the grip-strength technique, since this skill 
is simple and frequently used in many areas of physical therapy. 
However, the low’ scores are easily explained by the strict guidelines 
set by the American Society of Hand Therapists that were used 
during the evaluation of the participating therapists. 

The two remaining skills, joint mobilization and two-point dis- 
crimination, arc not strictly visual but are skills that require tactile 
input. Initially, there was a concern on how transmission of tactile 
feelings could be transmitted via telehcalth. With clear, concise 
instmetions and appropriate camera placement it was demonstrated 
that these two skills could he learned. 

in this study, it was demonstrated that telehealth teaching, when 
compared with the conventional teaching mixlel of direct face-to- 
face teaching, resulted in no statistically significant difference be- 
tween the pertbrmance scores for any of the five skills taught. How- 
ever, when compared with self-study, there were statistically 
significant differences in the performance scores, suggesting that the 
telehealth group learned more. Both of these results suggest that 
telehcalth may he used as effectively as the con\’entionaI method 
and mote effectively than self-study to teach these five assessment 
.skills. 

In examining the competency scores for the telehcalth and the 
direct, face-to-face groups there was once again no statistically sig- 
nificant difference between the groups at baseline. Differences in 
the competency levels were determined, however, after the edu- 
cational intervention, indicating that the groups had become more 
competent in all five skills. When the telehcalth group was com- 
pared with the self-study group, there were statistically significant 
differences between the groups’ competency scores for all five skills. 

The results of this study must he interpreted with some caution. 
Wc did not, for example, ask the participants in the self-study group 
what they had done to prepare for the post-test. Thus, although 
we asked them lo do what they would normally do if a patient 
being referred required that technique, we Jo not kninv what the 
participants in the SS group actually did or the length of time they 
might have spent preparing relative to the time spent in rhe formal 
intervention groups. It is unlikely that they spontaneously prac- 
ticed, and even more unlikely that they sought external feedback 
for their practice, two components of the formal training programs 
that were likely vcr>' important. Further, wc did not ask them what 
they would nonnally do in thc.se circumstances, so without further 
study we cannot say w'hether the SS group’s performance is repre- 
sentative of normal practice. 

Similarly, we do not know the extent of contamination between 
the gi»'iips. Although the participants were .specifically asked not 
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Table 1. Mean Pre- and Poat<te$t Scores of 42 Physical Therapisfs (In Three Groups) on Five Skills, University of Toronto, 1999 



Skill 


Group 


Pre-test 
Mean (SD) 


Post-test 
Mean (SD) 


Difference 


Volumetries 


Self study 


2.23 (1.23) 


3.19(1.25) 


0.95 (1.61)t 




Direct teaching 


2.17 (1.42) 


4.63 (0.46)t 


2.46 (1.36) 




Telehealth teaching 


2.23 (1.36) 


4.50 (0.40)t 


2.27 (1.13) 


Total active movement 


Self study 


1.45 (0.40) 


2.23 (1.08) 


0.78 (1.21)T 




Direct teaching 


1.38 (0.51) 


4.33 (0.75)t 


2.94 (0.65) 




Telehealth teaching 


1.58 (0.68) 


4.65 (0.59)t 


3-08 (0.87) 


Joint mobilization 


Self study 


2.98 (1.69) 


3.66 (1.33) 


0.67 (1.29)§ 




Direct teaching 


2.62 (1.41) 


4.92 (0.28)t 


2.31 (1.46) 




Telehealth teaching 


2.69 (1.60) 


4.85 (0.42)t 


2.15(0.51) 


Grip strength 


Sell study 


2.92 (1.22) 


3.06 (0.84) 


0.14 (1.52)§ 




Direct teaching 


3.06 (1.21) 


4.69 (0.23)t 


1.63 (1.14) 




Telehealth teaching 


3.13 (1.24) 


4.35 (1,1 3)t 


1.21 (1.23) 


Tv;o-point discrimination 


Self study 


1.11 (0.26) 


1.84 (1.03) 


0.73 (1.05)i 




Direct teaching 


1-13 (0.22) 


4.10 (0.79)t 


2.96 (0.83) 




Telehealth teaching 


1.12 (0.35) 


4.04 (0.89)t 


2.92 (0.93) 



* Scores were on a global rating scale (1 = poor, unable to perform; 3 = adequate knowledge; 5 = excellent performance). See text for description of the three groups and the pre- 
and post-tests. 

t Significantly different from self study on posMest by post-hoc Tukey test (p < 0.05). 
iSignificant improvement from pre- to post-test using paired Mest (p < 0.05). 

§ No significant improvement from pre- to post-test using paired /-test. 



to interact between groups, we did not subsequently determine the 
extent to which they had followed these instructions. This might 
limit the validity of the findings, although it is worth noting that 
the group that had more motivation to violate this injunction to 
speak to others continued to have lower scores. 

Despite these potential limitations, the current study gives us 
great hope for the use of the telehealth medium for teaching not 
only technical information but also technical skills. Establishing 
tclchealth as an effective teaching tool provides a merhtxl of con- 
tinuing education to community health care professionals who need 



to perform these types of technical skills. Therefore, all profession- 
als (nurses, therapists, doctors) would benefit from this technology, 
allowing increasingly early referral of complex cases to rhe com- 
munity for ongoing rehabilitation. If telehealth is utilized to 
transmit and teach the required information, continuity of special- 
ized care will be maintained with support provided to the com- 
munity practitioner. Perhaps teaching of all assessment skills will 
not be possible, hut telehealch will continue to provide a rich com- 
munication link between the acute care facilities and the com- 
munity. 



Table 2. Percentages of 42 Physical Therapists Identified as “Competent*’ in Each of Three Groups, on Five Skills, University of Toronto, 1993* 



Skill 


Group 


% after Pre-test 


% after Post-test 


Change 


Volumetries 


Self study 


37.5 


56.3 


18.8t 




Direct teaching 


30.8 


lOO.Ot 


69.2 




Teiehealth teaching 


30.8 


lOO.Ot 


69.2 


Total active movement 


Self study 


00.0 


12.5 


12.5t 




Direct teaching 


00.0 


76.91 


76.9 




Telehealth teaching 


00.0 


92.3t 


92.3 


Joint mobilization 


Self study 


56.3 


75.0 


18.84: 




Direct teaching 


53.8 


lOO.Ot 


46.2 




Telshealth teaching 


38.5 


lOO.Ot 


61.5 


Grip strength 


Self study 


68.8 


50.0 


-18.8t 




Direct teaching 


61 .5 


lOO.Ot 


38.5 




Telehealth teaching 


61.5 


84.6t 


23.1 


Two-point discrimination 


Self study 


000 


6.3 


6.3t 




Direct teaching 


00.0 


76.9t 


76.9 




Telehealth teaching 


00.0 


76.9t 


76.9 



• See text for description of groups and ihe pre- and post-tests, 
t Significantly diflerent from self study on post-test by chi-square {p < .05). 
iNo significant improvement from pre- to post-test using McNemar test 
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• YOU’VE GOT MAIL; DISTANCE EDUCATION 



Moderator: Penny jennett, P/iD 



A Controlled Trial of an Interactive, Web'based Virtual Reality Program for Teaching 
Physical Diagnosis Skills to Medical Students 

JULIE A. GRUNDMAN, ROBERT S. WIGTON, and DEVIN NICKOL 



Multimedia instruction offers many potential advantages over tra- 
ditional methods of instruction. Multimedia programs can interact 
with the learner, use graphic images, sound, and video, and keep 
track of progress. Students complete programs at their own pace 
while accessing material both at school and at home. Multimedia 
instruction can provide an interactive alternative to lectures and 
textbooks by quizzing the student over concepts as they are pre- 
sented and requiring that the student think about the material be- 
fore proceeding. 

While several studies have found that multimedia instmetion 
can be more efficient by reducing instructor and classroom rime, 
few have been able to show an increase in learning when compared 
with traditional methods of instruction.'"^ Santer and colleagues 
compared a multimedia textbook with a lecture presentation on 
the same material and found an increase in the post- rest scores of 
rhe multimedia group, but no difference when they compared the 
multimedia group with a group using a printed textbook.' Studies 
comparing multimedia and traditional approaches to learning in 
rhe areas of psychology’ and computer science instruction suggest 
an improvement in students’ performances using rhe multimedia 
versions.'"'' Thus, there is a need for well-designed studies to deter- 
mine whether multimedia Instruction more effectively facilitates 
students’ learning — including medic«il students’ learning — than do 
traditional methods, 

Multimedia instruction Ls particularly well suited to help students 
learn physical diagnosis. Sound, pictures, and movies augment the 
learning of examination skills and diagnosis findings by allowing 
students to hear heart and lung sounds, watch videos of physical 
examination procedutes, and see more pictures of pathologic find- 
ings than can be included in a textbook or lecture. These visual 
and audio aids should increase students’ recognition of these find- 
ings when encountered in patients. 

We wished to test whether a Weh-based multimedia program 
using interactive learning and virtual rcaliry would he more effi- 
cient and effective than traditional print-based self-study by med- 
ical students. To accomplish this, we designed a course on physical 
examination of the eye and car. Using thi.s material, we conducted 
a controlled study of first-year medical students to determine 
whether having studcnr.s use a multimedia version of rhe course 
resulted in a change in the time spent with the material and an 
increase in knowledge gained when compared with having students 
study a printed version of rhe .same material. 

Method 

Pamapflnrs. All 126 first-year medical students at the University 
<T Nebraska Medical Center G>llege of Medicine were invited to 
participate in this study in 1999; of these, 121 volunteered. We 
obtained Institutional Review Board approval and participiint.s gave 
informed consent. 

MM/itmedia Coutsc. Wc designed two courses, tme about the eye 
and the other about the car, to help first-year medical students learn 
physical diagnosis skids and tests. To minimize contamination, the 
subject matter chosen represented important clinical skills that or- 
dinarily would not be presented at this time and wa.s not partT^f 
the regular first-year curriculum. Besides otoscopy and hindusdop’/’, 
topics included the interpretation of audiograms and tympano- 



grams, as well as recognition of acute otitis media, serous otitis 
media, papilledema, glaucoma, and diabetic retinopathy. Some ot 
rhe topics were chosen because we predicted they could be more 
effectively taught multimedia presentations. We estimate rhe course 
took 300 hours to create at a cost of $300 in materials and $3,000 
in student labor. 

After developing the courses, w’e created both a multimedia and 
a printed verskm of rhe course materials. The multimedia version 
emphasized interactivity and included many pictures and Quick- 
Time virtual reality (QTVR) movies of the funduscopic examina- 
tions and otoscopy. These movies showed normal and pathologic 
views of the retina and tympanic membrane a.s they would be seen 
through an ophthalmoscope or otoscope. Students were asked to 
scan around the entire picture searching for pathology', just as they 
would when examining actual patients. In addition, the multimedia 
version led students on a defined path through the program by 
presenting a concept on each page and then requiring that the 
student respond to a question about the concept before proceeding. 
Correct answers all*'v/ed the student to proceed while students giv- 
ing incorrect answers were prox'ided with immediate feedback be- 
fore continuing. The multimedia version differed from the printed 
version in that it included Interactive questions and contained 
more color pictures. 

Sntdy Design. The students took a pre-test before the courses and 
a post-test afterwards. The pre-test consisted of 20 multiple-chince 
questions about physical examination skills and findings related to 
the eye and ear. The class had already been divided into 12 small 
study sections. In order to provide two groups marched with regard 
to their knowledge of the subject matter, wc divided the existing 
study sections into two groups matched on average pre-test score. 
Group A (n = 60) used the printed manual for the car material 
and the multimedia program for the eye marerial. Group B (n = 
61) did the rever.se. 

Students could access the materials one week before the post- 
test. Wc used three methods to track the time spent. The computer 
logged access and time spent with the multimedia marerial. To u.se 
the printed version, students had to check ii in and out of tlie 
library reference desk, creating a log of the time spent. In addition, 
we asked the students to keep their own records of the time spent. 
This wa.s reported on a post-study questionnaire. As an incentive, 
the student with the top score in each of the two groups was 
awarded a $25 cash prize. 

Tlie post-test was given in two sections. The first portion con- 
sisted of ten multiple-choice questions presented on the computer 
and included questions regarding virtual-reality simulations of fun- 
doscopy and otoscopy. Tlic second portion of the post-test, given 
on the following day, consisted of 40 multiple-choice questions, 20 
on the eye and 20 on the ear. These questions included gcncnil 
fact-based questions, im.agcs, and ca.se studies. All questions and 
images were oitfercnr from ihose used in either rhe written or the 
multimedia version. 

Ei’fllufltion and Analysis. The level of the studcnr.s’ acceptance of 
the program was evaluated with a written sinwcy. The students were 
asked which in.srnicrional method they preferred, how much time 
they had spent, and how they rated their learning using each 
method. 
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Table 1. PosMest Scores and Time Spent by Instruction Metbod, University of Nebraska Colleoe of Medicine, 1999' 







Multimedia Method 






Printed-version Method 








Questions correct 


Mean No. 




Questions correct 


Mean No. 


Skill 


Group 


Mean % 95% Clt 


Min. 


Group 


Mean % 95% Clt 


Min. 



Eye A 15.9 64 15.0-16.8 39.9 B 13.4 54 12.6-14.2 31.0 

Ear 8 16.1 64 15.3-16.9 48.6 A 14.5 58 13.6-15.4 36.0 



• A total of 121 first-year medical students (60 in Group A, 61 in Group B) took the post-test to examine their skills in topics related to physical diagnosis skiiis and tests concerning 
the human eye and ear. All comparisons between multimedia and printed version are significant at /? < .02 (MANOVA). 
t Confidence interval. 



Post'tesc scores were analyzed using multiple regression and AN- 
COVA, conrrolling for time spent on the multimedia or printed 
versions, section, pre-test scores, scores on the Medical College 
Admission Test (MCAT), and college grade-point average (GPA). 

Results 

To ensure that the two groups were equivalent, wc compared them 
with regard to their members* MCAT sections’ scores, undergrad- 
uate GPAs, and mean pre-test scores- The pre-test scores were sim- 
ilar, with group A correctly answering 8.65 of 20 (43.3%) and group 
B correctly answering 8.52 of 20 (42.6%). There was no signiheam 
difference between the groups’ mean Verbal Reasoning MCAT 
scores, Physical Sciences MCAT scores, Biological Science MCAT 
scores, total undergraduate GPAs, and pre-test scores. Reliability 
for the paper-based portion of the post- test was 0.69 using the Ku- 
der Richardson formula 20. 

TTie students who used the multimedia version of the eye or ear 
course scored higher on the respective post-tests than did the stu- 
dents who used the printed version, when compared using univar- 
iate analysis (see Table 1). This higher achievement in the multi- 
media version persisted when pre-test scores, MCAT scores, and 
undergraduate (IiPAs were controlled for by entering those variables 
into multivariate analysis. Multiple regression analysis showed that 
both use of the multimedia version and the time spent were pre- 
dictors of the eye course’s post-test score (r‘ = 0.24). Only the use 
of the multimedia version (and not time spent) predicted the ear 
course’s post-test score (r^ = 0.11). Interestingly, the students who 
used the multimedia version did not score higher on the subset of 
questions dealing specifically with the virtual-reality simulations or 
on the computer-based subset. 

Students using the multimedia version spent more time on the 
material than did those using the printed version (see Table 1). 
However, an increase in time could have resulted in an increase in 
the post-test score. With this possibility in mind, we found that 
when time and pre-test score were controlled for using ANCOVA, 
students using the multimedia version still performed better than 
did those using the printed manual (p < .001 ). The only correlation 
between time and post-test score occurred with the multimedia 
version of the eye information (r‘ = 0.61, p < .0001). 

The results of the written survey showed that 78% of the stu- 
dents preferred the multimedia version to the printed version and 
were interested in using similar programs for other areas of physical 
diagnosis. Most students indicated that their learning had been 
more effective using the multimedia version, and stared that they 
had enjoyed the virtual-reality movies and interactivity. The stu- 
dents did not report any difficulry' in accessing either the computer 
or the written version of the program. 

Discussion 

As described above, wc found that students using an imcractivc 
multimedia program improved more than did tho.se usii^a primed 



manual with the same content. The students spent more time using 
the multimedia version but also improved more, given the time 
spent. These findings indicate that the multimedia program was 
more effective. 

Which aspects of the program led to this increase in post-test 
score? Results showed that the increased time accounted for some, 
but not all, of the gain. The gain in knowledge could have been 
due to the increased time that students spent on the multimedia 
version, but this did not appear to be the case, since even after 
taking time into account, the multimedia program still showed a 
greater improvement. The virtual-reality movies may have contrib- 
uted, but our results showed improvement in all parts of the post- 
test, not just those dealing with virtual reality. The interactive di- 
alog probably played a large role in the programs effectiveness by 
encouraging the students to work through problems, inducing them 
to take more time on particular tasks and probably to give more 
attention to the material. The computer itself could have affected 
the results, but there was no indication that this was the case, since 
the multimedia group had a higher post-rest score but not a higher 
score on only the computer-based questions. 

Tliere are several limitations to the generalization of these re- 
sults. We have not yet measured the long-term retention of the 
information, or whether the students who used the multimedia ver- 
sion perform better when examining patients. The latter activity' 
may be best assessed using a perfortnance-hased assessment tool, 
such as an objective structured clinical examination. The multi- 
media version had more illustrations, hut the results suggest this 
was not a major cause of the differences in achievement, since the 
scores were not better on the computer part or on the few questions 
that involved pictures or feature recognition. The ability to include 
more pictures is an inherent advantage of multimedia programs 
over textbooks, where the cost of color pictures generally precludes 
their inclusion. Finally, we studied first-year students at only one 
medical school, so the generalizability of these results in other set- 
tings and other courses should be confinned. 

In our study, we selected two specific content areas that we 
thought were well suited to multimedia. But other areas of physical 
diagnosis may show a similar benefit when using multimedia as a 
learning resource, such as having the ability to listen to heart and 
lung sounds. Studies have already shown this to be of I'tenefit in 
teaching cardiac auscuIatot>’ skills.' 

Is multimedia instruction more efficient in the medical school 
setting.^ Multimedia programs can be reused and offer flexible 
scheduling, but complex programs may take considerably more time 
to develop. Lyon found that a multimedia program reduced instruc- 
tor time with no loss of student achievement.* In our study, the 
.students spent more time with the multimedia version, and 
the time spent resulted in greater achievement, but this version 
required greater development time. Our experience suggests that 
whether a multimedia program results in gains in efficacy may de- 
pend oruhe nature of the subject and the learning m<xle it replaces. 
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Conclusion 



Our study suggests chat multimedia learning, incorporating inter- 
activity and virtual reality, is more effective than traditional ap- 
proaches to teaching the eye and ear sections of the physical ex- 
amination, Learning was enhanced in all areas, not just chose 
dealing with virtual reality or the multimedia. The implications are 
that more multimedia courses in physical diagnosis techniques 
should be developed and evaluated. Further study is needed to de- 
termine what aspects of multimedia learning are most effective and 
how well the results found here will apply to other areas of the 
curriculum. 

Otrrcsptmdcnce: Robert S. Wigton, MS, MD, Sccium of General Internal Medicine, 
University of Nebraska Medical Center, 964285 Nebraska Medical Cenrer, Omaha, 
NE 68198-4285; e-mail: (Wigton@iinmc.edu). 
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• YOU’VE GOT MAIL: DISTANCE EDUCATION 



Moderator: Penny Jenimt, PhD 



Evaluation of a CME Probleni'based Learning Internet Discussion 

JOAN M. SARGEANT, R. ALLAN PURDY, MICHAEL J. ALLEN, SHAILESH NADKARNl, LINDA WATTON, 

and PEARL O’BRIEN 



Research into the impact of continuing medical education (CME) 
demonstrates that effective inter\’entions include practicc'enabling 
or reinforcing strategies, sequential activities, and/or a high degree 
of interaction among participants.^*' Problem-based learning (PBL), 
a strategy used in CME, engages participants in small-group inter- 
active learning, creating a context that reflects the practice setting 
by presenting actual cases as problems to be solved.' PBL specifi- 
cally has been shown to be an effective learning strategy' in CME."*'^ 

Traditionally, PBL participants have been required to be in the 
same place at the same time, but now the Internet enables inter- 
personal interaction that is independent of time and place. Using 
asynchronous (delayed) interaction via a bulletin board, learners in 
different locations can participate in on-line discussions at times 
convenient for them.^ For physicians this removes barriers (e.g., 
geographic location, practice responsibilities) to participating in 
conventional CME programs and interacting with fellow learners.' 

Although the Internet provides many opportunities for medical 
education,^"'^" a recent search of the medical literature revealed few 
studies of its use for interpersonal interaction in medical educa- 
tion,^"'** and only one study of on-line PBL, In that study, Chan*' 
attempted to determine the effectiveness of m on-line PBL pro- 
gram In a randomised controlled trial of 23 physicians. Group pro- 
cess, however, was not the focus of study, and the number of mes- 
sages was small (35 over two months). 

Barrows advocates that successful PBL requires facilitators to per- 
form four functions: navigating (guiding the group through the ac- 
tivities), facilitating (maintaining a constructive group process), 
questioning (using questions to deepen understanding), and diag- 
nosing (monitoring learners’ progress).''' Berge and Collins suggest 
facilitator roles for general on-line discussions, which include ped- 
agogic (ensuring the educational task is accomplished), social (cre- 
ating a friendly environment), managerial (administrating organi- 
zational elements), and technical (ensuring comfort with the 
technology) roles.''’ 

The purposes of the present study of an Internet (on-line) CME 
PBL discussion were (1) to describe the roles of facilitators, both 
on-line and off-line, that enable on-line discussion; (2) to deter- 
mine factors that influence learners’ participation in the on-line 
discussion; and (3) to determine learners’ satisfaction with the on- 
line discussion. The study was a process evaluation, which docu- 
ments and assesses the implementation of a program’s activities to 
guide further program planning.'^ 

Method 

Family practitioners in Nova bcotia comprised the target popula- 
tion, but other physicians could register. We reemited participants 
hy advertising the program locally and demonstrating it at a pro- 
vincial CME event. The inter\’ention, carried out in 1999, was an 
on-line case-based learning module on medication-induced head- 
ache (MIH) developed hy a neurologist for a conventional PBL 
CME workshop and modified for Internet presentation. We chose 
this program because rhe original workshop was successful' and 
the neurologist is an expert PBL facilitator interested in Internet 
learning. 

We used Weh-CT*' educational courseware for the module. Be- 
sides the casc-based discussion in rhe bulletin hoard, the module 



included a “lecture,” a quiz, a glossary’, and references. We encour- 
aged learners to review the lecture before joining the discussion. 
To meet the College of Family Physicians of Canada accreditation 
criteria for on-line CME programs; i.e., that the program be avail- 
able for a defined time period and provide the opportunity for phy- 
sician interaction,*' the module was available for one month and 
participants were required to post ac least one message in the bul- 
letin board. 

Using Berge and Collins’ facilitator roles,*’ wc outlined two gen- 
eral roles for the facilitators of the on-line PBL discussion. These 
were (1) the pedagogic, or content, role, assumed by the neurologist 
or content facilitator, and (2) a combined social (creating a sup- 
poriive environment), managerial, and technical role, assumed by 
two educators. A graduate student familiar with Web-CT also pro- 
vided technical support. 

We collected data using the Web-CT electronic activity record, 
the program evaluation questionnaire, facilitators’ records of on- 
line and off-line activities, bulletin board discussion transcript, a 
log of technical problems, and intcr\'iews with registrants who did 
not participate. The questionnaire was designed to evaluate all 
components of the on-line program and consisted of 51 closed- 
ended and seven open-ended questions. For this study, we used the 
nine closed-euded and one open-ended questions that addressed 
the case and bulletin-hoard discussion, and three closcd-ended and 
one open-ended question that addressed the general usefulness of 
the module. Participants completed the questionnaire electronically 
or on paper. 

Evaluation questionnaires received electronically were automat- 
ically entered into the Web-CT database, which computes descrip- 
tive statistics. Wc manually entered evaluations received hy paper. 
For the bulletin board discussion transcript, we used content anal- 
ysis to categorize data and identify themes.'' 

Results 

The 31 registrants were 28 family physicians, nvo family medicine 
residents, and one neurologist. The electronic activity* record 
showed that 12 registrants did not parricipare. Of these, three did 
not log into the program, and nine accessed the home page only. 
We attempted to contact these 12 and received responses from four. 
Two were unable to log on and had not contacted the “help line.” 
One reported personal computer failure and another had become 
“too busy.” Of the 19 who accessed rhe MIH module content, 14 
participated in the on-line case discussion. 

Fifteen of the 19 (79%) participants who accessed the module 
completed the evaluation questionnaire. These included the 14 
who po.stcd messagc.s and one who read bulletin board messages but 
did not post any. List 1 summarizes their demographic and com- 
puter usage data. 

Table 1 summarizes the same 15 respondents* ratings of items 
addressing the case discussion and the overall course. Items the 
learners, rated most highly included relevance of the content to 
iheir practices and the prompt response of the content facilitator 
to their messages. One learner commented on how the content 
facilitator responded, “Dr. P. made sure no one felt stupid about 
asking a question, which is very important.” They rated items ad- 
dressing the bulletin board the lowest. Related comments included. 
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List 1. Respondents’ Demographic and Computer-use Data, 
Daihousie University, 1999 



Sex 

Men 6 

Women 8 

Not noted 1 * 

Years in practice 

<5 years 4 

6-10 years 1 

1 1 -20 years 6 

21-30 3 

Rating of computer skills 

Beginner 7 

Average 6 

Expert 1 

What attracted you to take this course? 

Interest In headache 1 

New technology 9 

CME credit 0 

Convenience 5 

How many times did you go into the module? 

Once 2 

Twice 1 

3 times 1 

>3 times 9 

On average, how long did you spend each lime you were in the 
module? 

<30 min 3 

30-60 min 8 

1 -2 hours 2 

>2 hours 2 



‘ Not all of the 15 respondents to the evaluation questionnaire described in this article 
answered all the questions concerning demographics and computer use. 



“A ver>' frustrating experience because my computer skills were not 
advanced enough," and "Takes a while to get used to the bulletin 
board." Thirteen of the 15 respondents indicated that they wished 
to have more discussion-based on-line modules and would recom- 
mend this module to their peers. Supporting comments included, 
“The bulletin board was great once you got in” and "I think CME 
on-line will prove to be a Godsend for us rural physicians.” 

The bulletin board discussion transcript included 122 messages. 
The content facilit..tor posted 46 messages; the educator facilita- 
tors, 23; and the 14 learners, a total of 53. The numbers of messages 
posted per learner ranged from one to seven, with an average of 
3.8 messages per learner. TTic learners posted most messages in the 
last two weeks of the program. Most enrered the bulletin board to 
post messagf’s only once, although some wrote more than one mes- 
sage at that time. They interacted with the case and facilitators but 
rarely w'ith each other. 

Analysis of the bulletin board transcript revealed four themes. 
The.se were: content (discussion of the case and questions, 80 mes- 
sages), facilitative (supportive and encouraging comments, 22 mes- 
sages), introductory’ (personal introductions, 16 messages), and ad- 
ministrative/tcchnical (related to technical or logistic issue.s, four 
messages). The content expert and the learners posted all the con- 
tent messages and the educators po.sted 16 (T rhe 22 facilirative 
messages. 

The content facilitator accessed the bulletin hoard about every 
second day and responded to each new learner message, giving 
itive feedback and stimulating critical thinking. He spent a total 
of about 90 minutes on-line per w’cek. The educator facilirarors 
accessed the bulletin hoard on alternate days to welcome and en- 



courage learners and note problems. They also spent a total of about 
90 minutes oii-line per week. Off-line facilitator activities included 
contacting registrants to encourage participation, monitoring prog- 
ress, and resolving problems. The content facilitator spent about 
30 minutes per week in off-line activities, and the educators, about 
five hours per week. In addition, the facilitators undertook pre- 
course activities to encourage participation. These included faxing 
participants a welcome letter, instructions, and help line informa- 
tion; conducting a teleconference to explain the on-line process; 
and posting a welcome message in the bulletin board. 

Learners reported five technical problems to the help line. Four 
reported difficulty accessing the Web site, and one could not post 
a message in the bulletin board. Staff responded as promptly as 
possible and resolved each problem. 

Discussion 

Analysis of the on-line discussion confirmed that the anticipated 
facilitator roles were fulfilled. As content facilitator, the neurologist 
increased the depth and breadth of the content discussion, and the 
educator facilitators performed a social and supportive role by wel- 
coming and encouraging learners. However, the neurologist, 
through his supportive style and prompt responses, also fulfilled a 
social role, and, in fact, may not have needed the assistance of the 
educator facilitators. A program in which the content expert is less 
skilled in PBL facilitation may benefit more by the addition of a 
skilled facilitator. Encouraging participation was an important role, 
expanding to off-line activities and requiring more time than an- 
ticipated. The on-line administrative/technical role was small, but 
it was a critical off-line function. 

Despite these roles there w’erc deficiencies in the PBL discussion. 
Equal learner participation is a goal of PBL,*^ hut because most 
learners entered the discussion in the final week or two and often 



Table 1. Mean Ratings of items Addressing Case Discussion in the 
Bulletin Board and Overall Course by 15 Physician Participants in 
a CME Online PBL Program, Daihousie University, 1999* 



Item 




Mean (SD) 


The case content was applicable to my 
practice. 




4.4 (0.7) 


The case stimulated my thinking about 
patients in my practice. 




3.8 (0.9) 


The questions in the case clarified my 
understanding of the content. 




3.7 (0.6) 


The bulletin board was useful. 




3.5 (1.2) 


1 received enough instruction in the use of 
the bulletin board. 




3.6 (1.0) 


1 felt comfortable participating in the bulletin 
board. 




3.6 (1.4) 


Participating in discussions enhanced my 
understanding of the subject. 




3.8 (0.9) 


Discussions added value to the module. 




3.8 (1.0) 


The insiructor responded promptly to my 
questions. 




4.1 (0.7) 


The on-line case-based format is an 
effective learning method for me. 




3.8 (0.9) 




No. Saying No. Saying 




Yes 


No 


Based on this experience 1 would like to do 
more on-line modules. 


13 


2 


Based on this experience 1 would 
recommend the module to my peers. 


13 


2 


• Rating scale: 1 = strongly disagree. 2 = disagree, 3 


= neutral, 4 = agree. 5 = 'nongly 



agree 
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did not respond to messages, facilitating an ongoing discussion was 
difficult. Also, as few leamets interacted with each other, the dis- 
cussion was teacher-centered as opposed to learner-centered. A 
contributing factor may have been the requirement that learners 
post only one message in the bulletin board to receive credit. In- 
teraction in future modules may be improved by requesting that 
each participant post messages weekly and respond to co-partici- 
pants. 

Barriers to health care professionals’ adopting new communica- 
tions technologies are numerous, and include the lack of adequate 
technical, economic, organizational and behavioral knowledge. 
Lessening these barriers requires intensive learning strategies.’*’ Par- 
ticipating in an Internet discussion requires physicians to both 
adopt a new technology and change their learning behaviors. 
When asked what had attracted them to taking this course, nine 
of the 15 participants indicated “the opportunity to use new tech- 
nology'” hut, with respect to their technical knowledge, seven par- 
ticipants rated their computer skills as “beginnet.” Although we 
provided printed instructions, a help line, and off-line support by 
educational facilitators, at least two registrants did not participate 
in this program and another two would not participate in future 
programs because of technology-related issues. Our findings rein- 
force the need for educational softvs^are that can be easily used by 
learners who may lack computer proficiency and have little time 
for or interest in mastering new technology^ Providing more exten- 
sive training may increase participation, but scheduling this for busy 
physicians whose time is limited is difficult. 

This study had several limitations. The study population was 
small and the learners chose to participate, so it may not represent 
a larger physician group. Generalizability is also limited by the lack 
of a control group, and although the evaluation questionnaire dem- 
onstrated face validity through a pilot-testing process, we did not 
test it for reliability. Of the 31 registrants, only 19 (62%) partici- 
pated in the program. We learned the reasons for non-participation, 
important data for this study, of only four of the 12 non-partici- 
pants. Ensuring reliability of the tool before repeating the study, 
replicating it with other populations, and being more aggressive in 
contacting non -participant registrants would strengthen future sim- 
ilar studies. In spite of limitations, this study provides insight into 
facilitators’ roles in on-line PBL discussions and factors influencing 
learners' participation. It supports the view that on-line facilitators 
perform several roles on-line and off-line, and suggests that a chal- 
lenge for facilitating PBL discussions is to promote ongoing 
learner-learner interaction as opposed to one-time leamcr-teacher 
interaction. Current technology hinders participation, while 
prompt and supportive responses by facilitators to learners’ messages 
encourage it. All but two of the 15 learners completing the eval- 
uation said that they would like to have more modules, indicating 
that the benefits outweighed the disadvantages. 

Placing this small study within the context of physicians’ learn- 
ing, technology' adoption, and behavioral change assists in consid- 
ering its implications. PBL is an effective CME learning method 
that uses participant interaction. Tire Internet is a powerful tool 
that removes traditional harriers to both physicians’ participation 
in CME and their interaction with co-participants, hut it creates 
new barriers related to technology' and behavioral change. We need 
to leam ways to overcome these harriers, a task that may become 
easier as communication technologies and software applications im- 
prove, and as physicians entering the workforce become more ex- 
perienced in using computers. 
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• PLENARY— OUTSTANDING RESEARCH PAPERS 



Moderator: James O. Woolliscroft, MD 



Correlates of Physicians’ Endorsement of the Legalization of Physician-assisted Suicide 

KAREN D. NOVIELLl, MOHAMMADREZA HOJAT, THOMAS J. NASCA, JAMES B. ERDMANN, 

and J. JON VELOSKl 



Although most physicians recognize a duty to provide compassion- 
ate end-of-life care, they often feel ill prepared to do so. Of partic- 
ular controversy is physician-assisted suicide. Physician-assisted su- 
icide is commonly defined as the practice of providing a competent 
patient with a prescription for medication for the patient to use 
with the primary intention of ending his or her own life. In a recent 
survey of approximately 2,000 U.S. physicians, 3.3% reported that 
they had written jlc least one prescription to hasten death.^ Eleven 
percent reported they would write a prescription to hasten death if 
requested to do so under the current legal system. If legalized, 36% 
of the physicians would be willing to write a prescription to hasten 
death.' 

Consistent with the diversity of physicians’ opinions about the 
practice of assisted suicide, attitudes toward its legalization are also 
divided. When physicians in Michigan were asked to choose be- 
tween legalizing or banning assisted suicide, 56% favored legalizing 
it, while 37% voted for a specific ban.' 

Several studies have examined the demographic correlates of 
physicians’ attitudes towards assisted suicide. Although age and sex 
were unrelated to opinions about assisted suicide,* race was related. 
Furthermore, physicians’ and patients’ preferences for particular ap- 
proaches to end-of-life care follov/ed similar racial patterns. White 
physicians were more likely than African American physicians to 
endorse assisted suicide in terminal care scenarios.* Catholic and 
devoutly religious physicians were also less likely than others to 
endorse it.''’^ 

Physicians’ specialties may also help explain these differences of 
opinion. Oncologists were more likely to oppose assisted suicide.^^’' 
Similarly, support was higher among psychiatrists than among 
emergency medicine physic ians."' '*’^ Only one study investigated the 
rationales for physicians’ views on assisted suicide. One third of 
physicians in this study felt that it was immoral, 34% felt that it 
violated professional ethics, and 30% felt that it conflicted with 
their own religious beliefs.'"'' 

Since the legalization of physician-assisted suicide is an area 
where opinion is sharply divided, research is needed to understand 
the basis of physicians’ beliefs about it. This study was designed to 
examine the extent and correlates of physicians’ endorsements of 
the legalization of assisted suicide with regard to their specialties, 
sex, and opinions about certain other contemporary issues in the 
U.S. health care system. 

Method 

Graduates of Jefferson Medical College from the classes of 1987- 
1992 (N = 1,271) who were practicing medicine in the United 
Stares comprised the study population. Based on a search of rele- 
vant literature and two pilot studies," a survey was developed that 
consisted of 33 items to be answered on a five-point Likert scale 
(“strongly agree” = 5, to “strongly disagree” =1). The sur\'cy ad- 
dressed five aspects of changes in the U.S. health care system in- 
fluencing medical education, quality of care, patient referral, cost 
of care, ethical issues, and sociopolitical matters" (copies of the 
survey are available from the authors). The item reading “Physi- 
cian-assisted suicide should be legalized” was used as th^ d^ndenf 
variable in the present study. 

The questionnaires were mailed in May 1998, followed by three 



reminders mailed to non-respondents at three-week intervals. Use- 
able forms were returned by 835 physicians (66% response rate), of 
whom 830 responded to the item on the legalization of physician- 
assisted suicide. The respondents included 578 (69%) men and 257 
(31%) women, with a mean age of 35.8 years. The specialties of 
respondents were distributed as follows: family practice, 1 16 (14%), 
general internal medicine, 85 (10%), pediatrics, 38 (5%), emer- 
gency medicine, 49 (6%), obstetrics-gynecology, 34 (4%), surgery’ 
and surgical subspecialties, 47 (6%), psychiatry', 28 (3%), hospital- 
based Specialties (anesthesiology, pathology, and radiology), 97 
(12%), medical subspecialties, 86 (10%), and other specialties and 
subspecialties, 255 (30%). Statisrical analysis included bivariate 
and multivariate correlation, t test, chi-square, and z test for pro- 
portions. 

Results 

No significant difference was found between respondents and non- 
respondents with respect to gender (31% versus 33% women, re- 
spectively), age (35.8 versus 35.9 years), full-time salaried faculty 
appointment U4% versus 12%), and primary care practice (which 
was defined as family medicine, general internal medicine, and gen- 
eral pediatrics) (29% versus 34%). 

Similarly, no difference was found for academic performance 
measures such as scores on the United States Medical Licensing 
Examinations, Steps 1-3, and clinical competence ratings provided 
by residency program supervisors ar the end of the first postgraduate 
training year in three competence areas of “data-gathering and 
processing skills,” interpersonal skills and attitudes,” and “socioeco- 
nomic aspects of patient care.”'‘'^ 

Respondents’ Endorsement of Legalization of Physician-assisted Sui- 
cide. Of the 830 respondents, 284 (34%) endorsed legalization — 
73 (9%) “strongly agreed,” and 211 (25%) “agreed”; and 340 (41%) 
opposed it — 189 (23%) “disagreed,” 151 (18%) "strongly dis- 
agreed,” and 206 (25%) expressed “no opinion.” The response pat- 
terns were similar for physicians who graduated in the six different 
cohorts. 

Correlates of Endorsement of Legalization of Assisted Suicide. The 
endorsement rates for legalization of physician-assisted suicide were 
examined by the following variables: 

■ Demogrnp/i/c.s. Endorsement of legalization was unrelated to age 
and gender. Although the small number of African-American 
and Hispanic physicians in the sample was insufficient for mean- 
ingful statistical analysis. Asian physicians (n = 48) were signif- 
icantly more likely (63%) than were whites (n = 557) to endorse 
(43%) legalization (z-resr for proportions - 2.85, p < .01 ). 

■ Specialty. Orthopedic surgeons endorsed assisted suicide at the 
highest rare, which was 52%, followed by psychiatrists (41%), 
and physicians in the hospital-based specialties (40%). The low- 
est rates were for medical suhspecialisrs (25%), general internists 
(28%), emergency medicine physicians (31%), family physicians 
(33%), and general pediatricians (34%). These differences in 
atriuidcs toward legalization among specialties were statistically 
significant (xf.o> *= 33.7, p < .05). 

• Postgraduate ratings of clinical competence. The physicians who en- 
dorsed legalization had been rated significantly lower by their 
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Table 1. Bivariate and Multiple Correlations and Regression Coefficients Predicting 830 Physicians’ Endorsements 
of Physician-assisted Suicide, Jefferson Medical College* 



Predictort 



Physicians should unionize to maintain the Influence of their profession. 

The present paradigm of medical education does not take Into account the psychosocial factors related to illness, 
Government should be responsible for regulating policies that influence the quality of care. 

Learning to work in a changing health care environment should become an essential part of medical education. 
Physicians involved in HMOs or other types of managed care order fewer tests than those in private practice. 

The future of health care should be based on the needs of society not on the satisfaction of physicians. 
Physicians Involved in managed care have the same dedication to their patients as physicians in fee-for-service. 

Intercept 
Multiple R 



Bivariate r 


Regression 

Coefficient 


■17§ 


.121 


•15§ 


.15§ 


.12§ 


.12t 


•ii§ 


.14§ 


•ii§ 


•12* 


-.ii§ 


-.09§ 


-.08§ 


-.081: 




2.3§ 




•30§ 



•Participants were 830 physicians who graduated from Jefferson Medical College between 1987 and 1992. 

Tltems on 33-item survey that correlated either positively or negatively with respondents’ endorsement of physician-assisted suicide, 
ip < .05; §p < .01. 



residency program directors in the postgraduate clinical compe- 
tence areas of “interpersonal skills and attitudes” = 6.25, 

p < .01), and “socioeconomic aspects of patient care" (Fu, 45 ;> = 
6.94, p < .01). No significant difference was noted in the area 
of “data gathering and processing skills.” 

■ Ot/ier srgni/icant predictors of endorsement of legalization. Bivariate 
correlations between responses to the item on legalization and 
those for other 32 items in the survey were examined. Nine items 
had statistically significant correlations with endorsement of le- 
galization. A stepwise multiple regression algorithm was used, in 
which numerical weights assigned to responses to the item on 
legalization were considered as the dependent variable (criterion 
measure) and numerical weights assigned to the nine items of 
the surv’ey that had significant correlations with responses on the 
physician-assisted suicide item were the independent variables 
(predictors). Only seven items contributed significantly (p < .05) 
to the multiple regression model, which is summarized in Table 
1. Five contributed positively in that endorsement of legalization 
of physician-assisted suicide was associated with agreement with 
those items. Two contributed negatively, meaning that endorse- 
ment of legalization was associated with disagreement with those 
items. 

As reported in Table 1, those who endorsed legalization were 
more likely to agree that physicians should unionize (t = .17, p < 
.01), that the present paradigm of medical education does not take 
into account the psychosocial factors related to illness (r - .15, p 
< .01), that government should take responsibility to regulate 
health care policies (t = .12, p < .01), that learning to work in a 
changing health care environment should become an essential part 
of medical education (r = ,1 1, p < .01 ). and that physicians who 
work with managed care organizations order fewer tests than their 
counterparts in private practice (r = .11, p < .01). 

Conversely, the physicians who endorsed legalization were more 
likely to disagree that the future of health care should he based on 
the needs of society rather than on physicians’ satisfaction (r = 
-.11, p < .01) and that physicians in HMOs as compa.'ed with 
those in other settings have similar dedication to their patients (r 
= ”.08, p < .05). The multivariate correlation was .30, p < .01 
(see Table 1 ). 

It is noteworthy that the responses to the item on legalization 
were not correlated with several other items, including the consid- 
eration of cost as an important factor in patient care decisions, 
physicians’ support for the efforts of government to ration care, and 
the role that organized medicine .should take with respect t. 9 ,social 
issues that can influence the well-being of society. it 
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Discus.sion, Conclusions, and Implications 

The findings of the present study support prior research showing 
that physicians hold widely dLsparate view's regarding the legaliza- 
tion of physician-assisted suicide. More physicians in our study w'ere 
opposed to legalization (41%) than supported it (34%), and a sig- 
nificant fraction of these physicians (25%) had not formed an opin- 
ion. The proportion of physicians in our study favoring legalization 
was similar to those in other survey work in this area.' Almost all 
respondents endorsed medical school preparation for, and subse- 
quent provision of, compassionate care at the end of life (92%), 
suggesting that the differences of opinion related only to the con- 
troversial area of assisted suicide and not to caring for the dying 
patient in general. 

Our study found that physicians in the people-oriented special- 
ties most associated with direct and ongoing patient contact that 
included treatment of dying patients (general medicine, family 
medicine, and medical subspecialties) were less likely to endorse 
legalization than w'ere technolog^’-onented physicians, including 
hospital 'based specialists and orthopedic surgeons. Experience with 
the first year of legalized physician-assisted suicide in Oregon ac- 
knowledges rhe great emotional toll on physicians directly involved 
in its implementation.''* The emotional burden and the acknowl- 
edged complexities in caring for dying patients may make physi- 
cians involved in this process more reluctant to endorse legaliza- 
tion. An interesting corollary suggested by our findings is that 
physicians endorsing legalization were less comfortable with their 
medical school training in the psychosocial aspects of care and were 
rated poorer in the areas of interpersonal skills and attitudes and 
in socioeconomic aspects of patient care in the first year of rc.si- 
dcncy. 

It Ls not known what degree opinions about legalization are 
subject to modification by educational experiences during medical 
school. A recent study that examined medical students’ views on 
physician-assisted suicide found that fourth-year medical students 
in Oregon were les'i likely than were fourth-year medical students 
in other areas of the country to be willing to provide a patient with 
a lethal prescriptiim.'^ The authors suggested that a change in will- 
ingness to comply w'ith legalized phy.sician-a.ssisted suicide might 
have occurred as a result of experience with such requests from 
dying patients. 

Unlike many areas of medical education where knowledge is 
largely dependent on didactic teaching, care of the dying and al- 
titudes towards assisted .suicide are likely to be influenced primarily 
by personal experiences as well as rhe moral, ethical, and pc^litical 
tenets that adults bring to medical training. In addition to explor- 
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Ing more closely the relationship berween these personal beliefs and 
attitudes, an important priority for research is to determine whether 
attitudes towards care of the dying and physician-assisted suicide 
could be modifted by education. Evaluation of the impact of edu- 
cational experiences such as structured exposure to palliative care 
or rotations in a hospice sen'ice for medical students and residents 
would help to answer these questions. As the U.S. health care 
system moves from theory to practice regarding physician-assisted 
suicide, more research is needed to explore further the impact of 
legalization on physicians and their patients. 

The advantages of this survey include the large sample size, gen- 
der composition, and specialty and geographic distribution of the 
participants that represent a broad spectrum of the population of 
physicians. Despite these advantages, one limitation of our study is 
that it ascertained physicians’ views of the legalization of assisted 
suicide rather than their views of its practice. However, the two 
concepts seem logically related. The primary purpose of the survey 
was to gather views of multiple issues in the current health care 
system, including attitudes tou'ard legalization of assisted suicide. 
Another limitation is that the results of this study of young phy- 
sicians who graduated from a single private medical school in the 
Northeast may nor he fully gencralizablc to all U.S. physicians. 
However, the distribution of reactions is similar to that reported in 
the literature.' 

As physicians hold an influential position in the public debate 
on the legalization and practice of physician-assisted suicide, it is 
important to further understand the bases for their strong and dis- 
parate views. Futher research in this area should elucidate the po- 
litical, moral, and ethical framev/orks that physicians bring to this 
topic. Specifically, it is essential to understand the degree to which 
physicians’ views on the legalization of physician-assisted suicide 
are subject ro modification by medical education in general, and 
by experiences with dying patients in particular. 

[Xivciopmcnr ot the t'oundiirian tor this study wns supported, m part, by a grant from 
the BiiriMu ot Health Professions. Health Re.soutces and Services AdminMt.uion, 
USDHHS, under Ontperarive Asrocment 1 U76 MB00002'03, Centers t«>r Medical 
Education Researclt and Puhev. 

Correspondence: Karen Novielli, MD, Depanmen: of Family Medicine, jefterson Med- 
ical OillcRe, 401 Curtis, Philadelphia, PA 19 107; e-mail: (karen.noviclli@inail.rju.edu). 
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Learning Adolescent Psychosocial Interviewing Using Simulated Patients 



KIM BLAKE, KAREN V. MANN, DAVID M, 



The area of communication skills in adolescent medicine is emerg- 
ing as a distinct and important part of the undergraduate curricu- 
lum, An appropriate level of confidence in dealing with the ado- 
lescent population is deemed a necessary educational requirement.* 
Skills in psychosocial communication with adolescents differ from 
those required for younger patients and adults'"'"; they include dis- 
cussing confidentiality and adolescent risk-taking activities. Simu- 
lated patients can be used effectively in teaching and evaluating of 
communication skills.^"^ However, there is no report of using ad- 
olescent simulated patients to teach communication skills. 

The evidence available is inconclusive regarding the teaching 
time required to promote retention of communication skills, al- 
though a recent review^ suggests that one day’s training or less is 
not effective. Long-term retention of these skills has been supported 
by only one paper,® suggesting a need to follow students over time 
to ascertain the effect of communication skills training. 

Our study addressed two questions: (1) does feedback from a 
simulated adolescent patient and simulated mother lead to im- 
provements in fourth-year medical students’ psychosocial interview- 
ing of adolescent patients? and (2) does this skill persist following 
the inter\'ention? 

Method 

Final-year medical students (N = 68) from March 1998 through 
May 1999 were invited to participate, and 57 agreed. The 11 who 
were unavailable to participate were either interviewing for their 
postgraduate education, involved in presenting their own research, 
or unable to make the scheduled times for the simulations. Thirty 
five other class members were either randomly or self-selected to 
go to offsite locations for pediatrics, and therefore could not par- 
ticipate; however, this group acted as a non-randomised. control arm 
to the study, A two-group (57 students in the intervention group 
and 35 in the control) prospective randomized double-bind study 
design was employed. The students were completing an eight-week 
core pediatrics rotation in a tertiary center, wdth seven to nine 
students per rotation. 

Study Question 1 

Intervention. Four simulated cases were developed, each com- 
prising both a medical component (epilepsy, diabetes, attention def- 
icit disorder, or asthma) and risk-taking activities (smoking, drugs, 
boyfriend issues) in which the adolescent was scripted to be in- 
volved. Nine simulated mothers and ten female adolescents (mean 
age 13.6 years) were recruited using established procedures.*^ 
Mother-and -daughter pairs were selected as this is the commonest 
adolescent presentation in medical practice. Young adolescents 
were chosen to provide a realistic presentation of this age group, 
which often presents a challenge to young doctors. The training 
for standardized feedback was achieved when all mothers reviewed 
a single taped scenario, scored this independently using a structured 
form, and then discussed the feedback they would provide the stu- 
dent in a group setting. The adolescents were guided by their part- 
ner mothers to give feedback, which the adolescents discussed in a 
focus group. 

Ar study entry, all students signed informed consent forms. Tlicy 
then interviewed a simulated mother-daughter pair. The students 
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were randomly assigned to receive immediate feedback following 
the prete.st interview from the simulated pair (F^), or to receive no 
feedback (F‘). All students conducted a second interview four 
weeks later using a different cas'' scenario. All students (F* and 
received feedback from the simulated pair following this post-test 
inter\-icw. Feedback was structured using a written modified Cal- 
gary-Cambridge guide and given verbally; both interview content 
and process were addressed. 

Measures. Three measures were taken: 

1. Questionnaire. At study entiy^ demographic data and stu- 
dents’ self-ratings of prior experiences with adolescent medicine, 
confidence in dealing with adolescent patients, and anticipated fu- 
ture work W'ith adolescents were collected. 

2. Pre-test. Students conducted a one-hour videotaped interview 
with a simulated adolescent and mother, using one of the four case 
scenarios, at the midpoint of their rotation. The videotaped inter- 
views were scored by a psychologist who had been trained to reach 
an acceptable level of agreement with the principal investigator 
(KB) using the modified Calgary-Cambridge guide. 

3. Post-test. Four weeks later, each student conducted a second 
videotaped interview, using a different case scenario. Scoring was 
completed in the same manner as for the pre-test. 

Study Question 2 

Interrencion. The entire final-year class participated in a man- 
datoiy' ten-station OSCE prior to graduation. This was two to 12 
months after participation in the study (mean 6.6 months). One 
pediatrics station of this OSCE tested general pediatrics knowledge 
(students* performances in asking about medical aspects of the case ) 
and adolescent psychosocial interviewing (students’ performances 
in asking about psychosocial aspects, e.g., boyfriend, alcohol, 
drugs). The OSCE included 35 off-site students, those not involved 
in the adolescent interviewing study, i.e., those who had not been 
videotaped and had received no feedback (F^) and 45 of the 57 
students who had completed their pediatrics rotation at the tertiary 
center and who had participated in the study (F‘ and F^). 

Measures. The kne pledge score and the psychosocial interview- 
ing score on the pediatrics OSCE station were obtained from the 
checklists completed by the facult>» examiner at the station. 

Data Analysis 

Study Qiiesiion 1. A single psychologist, blinded to student group 
or time of interview, scored the tapes, using a modified Calgar^'- 
Camhridge Obser\'ation Guide. The psychologist evaluated eight 
aspects of the encounter: how the student initiated the session, 
collected information, gathered information, asked the parent for 
time alone with the patient, dealt with the adolescent alone, and 
acted before and during the examination and closure. Each section 
yielded a global score. Within seven of the sections there were 
between three and ten individual items. The section used to rate 
when the student was alone with the adolescent included 14 psy- 
chosocial elements (i.e., boyfriend issues, smoking, and drugs). 

The psychologist derived eight global ratings for each videotape. 
The global ratings for F' and F‘ students at pre- and post-test were 
compared using a paired t test. Regression analysis was conducted 
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using student global ratings from the eight sections of the modified 
Calgary -Cambridge Observaton Guide as the dependent (out- 
come) variable. The independent (predictor) variables were feed- 
back, case type and simulator, gender, previous medical experience 
with adolescents, comfort level in relating to adolescents, future 
career plans, and the students’ scores on the pre-test case. 

Scudy Quesuon 2. The knowledge score and the psychosocial in- 
terviewing score on the pediatrics OSCE station were compared 
among the three groups (P\ F\ F‘). 

Results 

Complete data were available for 52 of the 57 students (F' = 31; 
F’ = 21) who completed both pre- and post-test inteniews. Two 
tapes could not be rated, and three students did not complete the 
second interview. 

Stitdy QirestJon L The mean pre-test scores of the group receiv- 
ing feedback after their first intenriew (F*) and those receiving no 
initial feedback (F*) were nor statistically different (72.93, SD = 
9.43 versus 72.77, SD = 8.08; p = 0.95). However, the group that 
received feedback immediately after their first interviews (F") 
scored significantly higher on the post-test (82.81, (SD = 9.79) 
than did the F’ group (76.34, SD - 9.43); p = 0.02). No significant 
improvement was .seen from pre-test to post-test for the group re- 
ceiving no initial feedback (F‘). However, the group receiving feed- 
back (F‘) improved significantly from pre-test to post-test (p = 
0 . 02 ). 

Regression analysis revealed that receiving feedback was the only 
significant predictor (p = .021) of students’ performances on the 
post-test case (R* = .10 for the complete model). The other inde- 
pendent variables did not significantly predict post-test perfor- 
mance. Analyses also were conducted to determine whether or not 
the particular case scenario used had a significant influence on stu- 
dent performance. No statistically significant influence due to case 
difference emerged. 

Stidy Question 2. All students participating in the study received 
feedback cither once (F’) or twice (F‘). Both groups (n = 45) had 
significantly higher mean scores (p - .023) on the adolescent psy- 
chosocial inquiry on the final-year OSCE station (68.06, SD - 
24.07) compared with the students (n = 35) who completed their 
core pediatrics rotation at the offsite placements (F^') (55.71, SD = 
23.16). The groups did not differ significantly (p = 0.40) in their 
mean scores for the general knowledge aspects of this OSCE station 
(F' and F*’) (70.71, SD = 16.88) compared with (67.53, SD = 
16.69). 

After the OSCE the students were asked to comment on their 
clerkship experience. The simulated adolescent encounters were 
rated as one of the most positive learning experiences in the two 
years of clerkship. 

Discussion 

The main study finding is that the important communication skill 
of interviewing the adolescent patieiit can successfully be taught to 
undergraduate medical students. The teaching becomes faculty-in- 
dependent when the simulated patients are scripted and trained in 
giving structured feedback. The training period, which was a one- 
hour interview (experimental), followed by 20 minutes of feedback, 
was much less rhan one day, which is the time reported in the 
literature as neccssar^^ for effective learning of these skills. Tliis 
.study poses questions for further research regarding optimal training 
time and the best method of reinforcement. For psycho.social in- 
tcrv’icwing with sensitive questioning, clerkship seems the optimal 
point of instruction; however, there is little evidence to inform 
where training in communication with adolescents should be 
placed in the medical curriculum. 

There arc several limitations to this study. First, the sample was 
small, although icp resent at ive of other randomized controlled trials 



in this field. Second, selection bias may hav'e occurred, as the stu- 
dents who chose to complete their core pediatrics rotations at off- 
site placements were either randomly or self-selected. However, all 
students received the core pediatrics tutorials from the tertiary' cen- 
ter by teleconference, along with detailed objectives. This ensured 
that all students received the same didactic curriculum. Tb.ird, al- 
though the study would have benefited from two independent rat- 
ers, the increased cost was prohibitive- The psychologist rater was 
trained to use the modified Calgary'-Cambridge Guide’^ and un- 
denvent a mid-study validation of his scoring. Fourth, our sample 
was confined to mothers and daughters; whether the results would 
differ with mother-son simulator pairs is unclear. Fifth, although 
this study provides some indication that students’ psychosocial 
communication skills can be improved and maintained over time, 
follow up was less than a year. Continued tracking of these doctors 
would he important to see whether this master^’ is maintained into 
the residency years. Finally, application of these results must con- 
sider resources. At our medical school, standardized patients fre- 
quently supplement current teaching activities, and are part of the 
diagnostic assessment of student skills throughout the medical 
school curriculum. Expertise to train and administer such a program 
is quite involved from a logistic and monetary' standpoint; although 
available at our medical sch>>ol, this may not be the case c\^ery’- 
where. As this educational initiative relies on a realistic portrayal 
and structured feedback from the adolescent, time spent in recruit- 
ment and training of the standardized patients is important. 

Students overwhelmingly commented that feedback from a “real” 
adolescent was vfiry helpful, as they had received lirtle training in 
this area. Many of the students were very apprehensive on entry 
into the study, but were resoundingly positive after they had ci»m- 
pleted it. 

Because of the changing nature of the hospitalized patient pop- 
ulation, standardized patients could be used to ensure that each 
student has exposure to common ambulatory problems. They could 
help ensure uniformity in teaching and learning of basic clinical 
skills. Inteiwiewing an adolescent standardized patient who is in- 
volved in risk-taking activities provides the student an opportunity 
to practice psychosocial interviewing in a safe setting. The imme- 
diate feedback provided by the adolescent and mother is a powerful 
teaching tool- The suident can then return to the clinical setting 
to apply these newly acquired skills. 

In conclusion, this randomized controlled trial has shown that 
final-year medical students can be taught adolescent inter\'iewing 
skills and chat these skills are retained for as long as a year. The 
teaching time required for such an inten'ention is short (90 
minutes), and teaching can be independent of faculty once the 
simulators’ training is completed. As the skill of talking to adoles- 
cents and their parents is an important part of physician training, 
we would recommend that medical schools consider this structured 
training for their cunicula. 
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• PLENARY— OUTSTANDING RESEARCH PAPERS 



Moderator: James O. Wooll/sao/t, MD 



Have Clinical Teaching Effectiveness Ratings Changed with the Medical College of Wisconsin’s 

Entry into the Health Care Marketplace? 

DAWN BRAGG, ROBERT TREAT, and DEBORAH E. SIMPSON 



Medical schools, as competitors in today’s health care marketplace, 
have the challenge of training future physicians while increasingly 
relying on clinical revenues.* Is teaching compatible with compet- 
itive managed care in the future of health care’’ 

Skeff, Bowen, and Irby argue that teaching takes time and that 
its values must be re-emphasi:ed as a core mission of medical 
schools.' Medical education researchers have reported diminishing 
amounts of time available for physicians’ educational responsibili- 
ties to both residents'* and medical students.^ Student evaluations 
reveal that there has been less time available for them in more 
recent years.** Thus, time impacts on education have been docu- 
mented, but the critical issue to be investigated is whether the 
quality of teaching has been compromised. 

As a large, private medical school, the Medical College of Wis- 
consin (MCW) has not escaped the grasp of today’s competitive 
health care environment. On December 31, 1995, the John L. 
Doyne Hospital (JLDH), formerly Milwaukee County General Hos- 
pital, was closed. While this facility (a primary^ practice and clinical 
teaching site) was purchased by a private adult not-for-profit hos- 
pital, it’s sale nonetheless serv^es as a major demarcation point in 
MCW’s transition into today’s health care marketplace. Indigent 
care was now provided on a competitive contract basis. Our faculry 
formed a clinical practice group to enhance their competitive po- 
sition in this evolving health care environment. Declining federal 
support for graduate medical education led to decreased positions 
in selected specialties and their associated support of medical stu- 
dent education. Vyhile the multi-dimensional impact of these 
changes on medical education, at MCW and elsewhere, will take 
years to analyze,' preliminary' analysis can reveal whether the qual- 
ity of clinical teaching has changed during this time period. This 
study, therefore, examined whether there have been changes in 
clinical teaching effectiveness ratings as clinicians at MCW com- 
pete for patients and revenue. 

Method 

The study utilized student ratings of clinical teachers from a lon- 
gitudinal clinical teaching database implemented in 1992. A stan- 
dard clinical teaching instrument® is used across participating clin- 
ical departments. The instrument contains 16 characteristics of 
effective clinical teaching, derived ftom a comprehensive review of 
the literature, rated using a five-point Likert scale (1 = most posi- 
tive). Items address faculty interaction with students (c.g., actively 
involved me with patients, provided timely, constructive feedback 
without belittling me), ability to communicate (e.g., clear, orga- 
nized, answered my questions clearly), and overall teaching effec- 
tiveness. The form is highly reliable, with a coefficient alpha of .96. 

Since 1992, third-year medical students have evaluated 295 full- 
time clinical teachers in pediatrics, internal medicine, family med- 
icine, anesthesiology, and general surgery. For purposes of this study, 
the data were divided into three time periods, using 1995 as the 
benchmark date for MCW’s entry inro health care marketplace: 
before-entry, 1993-94; at-entry, 1995-96; and after-entry', 1997-98 
(numbers of cvaluarions per peruxJ = 1,327, 4,354, and 6,577 rax 
spectively). ’ 

A threc-srage analytic process was used to determine whether 
students’ ratings of clinical teaching had changed during the study 



period. First, the 16 clinical teaching instrument items were clus- 
tered to facilitate analysis using agglomerative hierarchical cluster 
analysis (HCA).'’This merhod has been successfully used to cluster 
items on standardized tests into psychological dimensions.*’"' In 
HCA for an n-item rest, there are n solutions or clusters. In the 
first step, each irem comprises one cluster. At subsequent steps, the 
procedure combines two clusters from the previous step, based upon 
the proximity or similarity among each possible pairing of the clus- 
ters. The smaller the proximity* value, the more similar the two 
clusters are believed to be. The final cluster, the nth cluster, places 
all of the items into one cluster. By examining the two- or three- 
cluster solution for interprecability, a researcher can get a nonpar- 
amctric perspective on groups of items that may be considered to 
be dimensionally distinct. Unlike factor analysis, cluster analysis is 
nonparamctric and is a quick way to identify- possible dimensions 
that may exist. In this study, selected clusters of clinical teach- 
ing skills were examined for internal consistency using coefficient 
alpha. 

Using these clusters, two-way analysis of variance was performed 
comparing the cluster means to determine whether (1) students’ 
ratings varied by time period; (2) students’ ratings varied by item 
cluster; and (3) there was an interaction effect between time pe- 
riods and clusters. Individual items that had been closely associated 
with the availability of teaching time in previous studies w’cre then 
analyzed using a one-way analysis of variance to examine differ- 
ences in student ratings across the three time periods. 

Results 

A three-cluster solution resulting from the HCA w'as selected for 
statistical and substantive reasons and to increase comparability of 
results with findings from prior factor-analytic studies. Ullian ct 
al.,'* in their synthesis of factor-analytic studies, reported that while 
there are varying numbers of factors, most studies suggest four so- 
lutions. The threc-clustcr solution was selected for this study as the 
cv/o-cluster solution contained many items that did not seem ro fit 
qualitatively and other cluster solutions contained at least one 
group with fewer than four items, posing a threat to internal con- 
sistency. The three clusters were examined qualitatively to assess 
content validity and their relationship to Ullian s four factors. 

The first cluster of clinical teaching skills was labeled iupervliorl 
person and contained seven items: supportive of me/had rapport 
with me, approachable/available, actively involved me with pa- 
tients, communicated expectations, demonstrated skills/procedures 
to be learned, provided opportunities to practice diagnostic/assess- 
ment skills, and provided feedback without belittling me. The sec- 
ond cluster was labeled physicianj teacher and contained five items; 
answered questions clearly, asked questions clearly, explained basis 
for decisions/actions, clcar/organized, and clinically competent/ 
knowledgeable. The third group, containing four items, was labeled 
inscnictorlleadei': took advantage of teaching opporruniries, enthu- 
siastic/stimulacing, responded ro student-initiated learning issues, 
and emphasized comprehension rarher than factual recall. All three 
item clusters, supervisor/person, physic ian/tcachcr, and instructor/ 
leader, w»ere found to be highly reliable (coefficient alpha - .90, 
.86, .80, respectively). According to Ullian et ah, these three clus- 
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ters define the roles that clinical teachers assume in their interac- 
tions with students. 

The students’ ratings ranged from a minimum of 1 (most posi- 
tive) to a maximum of 5 (least positive). Mean ratings across the 
three time periods were found to differ significantly (p < .001) (sec 
Table 1). Post-hoc comparisons (i,e„ Tukey test) revealed that the 
mean ratings for the periods were significantly different (ail com- 
parisons p < .001 ). Mean student ratings for the three clusters were 
also significantly different (p < .001). Throughout the bcforc-cntry', 
at-entrv', and after-entry’ periods, physician/tcachor skills were rated 
best by third-year students, while supervisor/person skills received 
the worst ratings (see Table 1). The analysis also showed an inter- 
action between the time periods and the three groups (p < .001). 

Mean student ratings for the three sets of skills started out pos- 
itively in the first, beforc-entry year (see Figijre 1). This was due 
to the fact that in 1993 faculty began to receive the first results of 
their clinical teaching evaluations. As reported in a prior study, 
when faculty receive clinical teaching evaluation results, their clin- 
ical teaching ratings improve as they immediately seek to address 
deficits.' “ Mean ratings for supervdsor/person and instructor/leader 
skills increased (became worse) sharply In the second year. Mean 
ratings for physician/teacher continued to improve throughout the 
before-entry years. During the at-entry period, mean tatings for su- 
pervisor/person and instructor/leader skills continued to increase 
(becoming worse), but the ratings increased only gradually for phy- 
sician/teacher. Supervisor/person skills peaked in 1996, the year the 
faculty practice plan was implemented. Mean ratings for instructor/ 
leader and physician/teacher leveled off bctw’ecn 1995 and 1996. 
The after-entry period saw improved ratings for the three item clus- 
ters. How'cver, none of the cluster ratings returned to the before - 
entry baseline level. 

Of particular importance were the significant differences across 
rime periods among the mean ratings of those characteristics as- 
sociated with the availability of time. For example, mean ratings 
of items within the supervisor/person (e.g., supportive of me, ap- 
proachable/available, actively involved me w’ith patients) followed 
the increased cluster ratings. However, the ratings for “provided 
timely, constructive feedback without belittling me,” received in- 
creasingly poor ratings across the three time periods. Analysis in- 



Table 1. Third-year Medical Students’ Ratings of Physicians’ Clinical 
Teaching Skills before, at, and after the Medical College of Wisconsin’s 
Entry into the Health Care Marketplace, 1992-1998 







Rating* 






Before Entry 


At Entry 


After Entry 




(n= 1.327)t 


(n = 4,354)t 


(n = 6.577)t 




Mean (SD) 


Mean (SD) 


Mean (SD) 


Skills clusters 








Supervisor/person 


1.53 (.57) 


1.79 (.72) 


1.76 (.70) 


PhysiciarVteacher 


1.43 (.49) 


1.49 (.52) 


1.47 (.54) 


Instructor/leader 
Individual items 


1.48 (.54) 


1.60 (.65) 


1.55 (.64) 


Supportive of me/had 
rappon with me 
Actively involved me 


1.44 (.71) 


1.69 ( 90) 


1.67 (.88) 


with patients 


1.37 (.67) 


1.75 (.89) 


1.67 (.87) 


Approachabie/avaiiable 
Provided timely con- 


1.40 (.71) 


1.59 (.84) 


1.57 (.86) 


structive feedback 
without belittling 
me 


1.58 (.78) 


1.76 (.92) 


1.80 (.93) 



‘Scale: 1 = most positive to 5 = least positive. 

t/t = number of evaluations - 5 '^^ 




Years 

Supervisor/Person Instructor/Leader Physician/Teacher 

Fijjwre I. Students’ mean ratinir, of physicians’ clinical teaching skills across the 
before-ontry’ (1993-94), at-entry U995-96), and aftcr-enir>- (1997-98) time 
periods. 



dicated that all four 4uesri<»ns within this cluster were significantly 
different across the rime periods (p < .005). 

Discussion 

Longitudinal analysi.s of a clinical teaching evaluation data set re- 
veals that the overall effectiveness of our clinical teaching de- 
creased from a before-entry high at the time of entiy' in the health 
care marketplace. Over the at-entr^* study period, evaluations did 
gradually improve, but did not return to the heforc-entry baseline 
rate. How'cver, not all item ratings were equally affected, with phy- 
sician/tcachcr skills (c.g., clear/organized, clinically competent) 
showing the least change and supervisor/person skills (e.g., ap- 
proachable, available, supportive of me, actively involved me with 
patients, provided timely, constructive feedback without belittling 
me) showing the largest decline. The supervisor/person skills, con- 
taining the interpersonal itcm.s, appear tt> have been the most pro- 
foundly affected by the entry into the health care marketplace. 

Although it may be possible that students become more discrim- 
inating in their assessments of teaching and teachers over time, this 
study does not rept^rt ratings by the same students over time. This 
study used ratings by individual third-year classes for six years. In 
addition, student ratings w^ere averaged over tw’o years for each time 
period, thus minimizing huge class differences. 

HCFA guidelines, increased pressures for clinical productivity, 
and accountability for cost-effective patient care have led physi- 
cians to repeatedly report that they have less time for clinical teach- 
ing. The results of this study suggest that there has also been a 
change in the quality of clinical teaching, as measured by the clin- 
ical teaching effect ivenes,s ratings over rhi.s critical time period, a 
relation.ship requiring further study to determine causality. While 
it is promising that the raring results do appear to have improved 
following an initial decline during the ar-entr\* period, the fact that 
these ratings did not return to baseline levels is distressing. 

Supers’ jsor/person skills are critical components of the teaching/ 
learning process, as education is enhanced when there is a suppor- 
tive relationship between the learner and the teacher." Medical 
schools must prepare clinical educators with teaching skills that are 
effective and efficient in todays time- pressured clinical environ- 
ments and implement real reward structures that recognize the 
value of time spent in clinical teaching if we arc to maintain the 
quality of our clinical education. 
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Six-year Documentation of the Association between Excellent Clinical Teaching and Improved 

Students’ Examination Performances 

CHARLES H. GRIFFITH III, JOHN C GEORGESEN, and JOHN F. WILSON 



With increasing fiscal pressures on academic medical centers, many 
institutions arc moving towards mission-based financing, the notion 
that the clinical, research, and teaching missions muse no longer 
depend upon cross-subsidization but must financially support them- 
selves.* With this increased mission-specific accountability, there 
will he greater emphasis on measurable outcomes to justify the costs 
associated with the mission. In the realm of clinical teaching, the 
literature is replete with studies of qualities of excellent teachers," 
studies of how to measure teaching,’ and studies demonstrating that 
faculty development in teaching can influence clinical teachers’ 
self-reported behaviors,’’ actual behaviors,' and teaching ratings.*' 
However, for the most part, the fundamental outcome of teaching 
has been left unstudied: that is, does the quality of teaching actually 
influence student learning? Although this may seem a truism too 
obvious for investigation, despite the cherished belief of clinical 
teachers there is ver>' little quantitative evidence that better teach- 
ing is associated with enhanced student learning. 

We recently reported the first documentation of the association 
of students’ learning with the relative teaching abilities of attending 
physicians^*^ and residents.’ In these studies of students and their 
clinical teachers over the academic years 1993-1995, wc found that 
medical students who worked on their internal medicine or surgical 
clinical clerkships with our best clinical teachers scored signifi- 
cantly higher on post-clcrkship examinations and even on the U.S. 
Medical Licensing Examination (USMLE) Step 2. CXir findings 
have been replicated at the University of Michigan.**' The only 
other study noting an association of teaching with learning, pub- 
lished in 1983, involved high school students in a remedial math 
class.** To our knowledge, this is the extent of the quantitative 
evidence in all the educational literature that better teaching is 
associated with better learning. 

Out previous reports, however, had several limitations. For one, 
our measure of teaching “quality” was based only on students’ rat- 
ings. One can argue (as we did in those articles) that the learners 
are the best judges of the learning climate. Even though wc con- 
trolled in our analysis for prior student academic achievement 
(USMLE Step 1 scores), it was possible that students especially 
excited about internal medicine scored better on internal medicine 
examinations and, in their enthusiasm, rated their instructors 
higher, with a spurious association of examination performance and 
teaching rating. Second, though statistically significant, our effect 
sizes were modest, amounting to one-sixth to onc-seventh of a stan- 
dard deviation on a rest, or, fir example, three points on the 
USMLE Step 2. Third, these studies encompassed only tw'o aca- 
demic years, and a limited number of teachers and students. Be- 
cause this sample was small, wc included in the analysis all teachers 
regardless of the numbers of students they worked with, even those 
w'ith few teaching ratings. Though we were gratified to demonstrate 
an association between teaching and learning, our results may ha\'e 
been attenuated by the small sample and the inclusii^n of in the 
analysis of all teachers, regardless of numbers of teaching evalua- 
tions (teachers with imprecise mca.'^urcs of their teaching ability). 

Therefore, the purpose of this project was to refine the methcxl 
of our previous studies by using a larger sample of students and 
attending physicians, more precise measures of teaching ability, 
and a way of disentangling the potential confounders of raters and 
teachers. Our formal hypothesis was that students who are exposed 



to our highest-rated attending physicians during their internal med- 
icine clerkship will score better on end-of-clerkship examinations 
and on the USMLE Step 2. 

Method 

This work represents an extension of the data set from our previous 
reports,*^’ extending the sample size from two academic years to six. 
Tlic study design, a prospective cohort study, involves data on stu- 
dents and their attending physicians, and notes the association of 
the students’ examination performance.s with the “quality” of the 
attending physicians to whom they were exposed. The participants 
were all third-year medical students at the University of Kentucky 
College of Medicine, over the academic years 1993-1999 and their 
attending physicians on the inpatient general medicine services. 

To give the reader a sense of the structure of our clerkship, stu- 
dents in the third year spend eight weeks on general medicine 
inpatient services, four at our university hospital, and four at our 
affiliated Veteran’s Affairs hospital. A team consists of an attending 
physician, a supervising junior or senior medicine resident, two 
first-year residents, and two students. Importantly, students, house- 
staff, and attendings are randomly and independently assigned to 
the services (wc do not take requests by students for specific at- 
tendings). Attending physicians may he either general internists or 
specialists. Ni'tc that students are exposed to new and different 
attending physicians and housestaff in each of the twt^ four-week 
components of the clerkship. Ambulator^’ medicine is part of a 
separate primary care clerkship and is nut included in the study. 
Attending physicians usually participate in or observe daily man- 
agement rounds, and have formal separate teaching rounds three 
times per w'eek, ideally focused on one or two patients on the ser- 
vice, usually at the bedside. 

Our model for how teaching might influence students’ learning 
was not that students would be influenced by the average leaching 
ability of all the instructors they worked with, hut rather that stu- 
dents' learning would he enhanced by individual outstanding in- 
structors who, in the learning climate they engender and the in- 
spiration they provide each day, stimulate students to he excited 
about clinical medicine, resulting in students’ learning nor only 
throughout the clerkship hut throughout all their clinical rotations. 
Therefore, we explored the associations of students’ learning with 
exposures to particularly outstanding (or poor) attending physi- 
cians, rather than with the average ability of their two attending 
physicians- 

In our prior studies,*'’ we simply defined “hc.sr” and “wor.’^r” at- 
tending physicivins as those with the highest and lowest teaching 
evaluations, as rated by students. However, as mentioned in our 
introduction, this could lead to a confounding of teaching rating 
with examination performance by a student who may perform bet- 
ter (and rate the physician attending higher) because of interest in 
internal medicine. Therefore, for this study, we elected to pursue 
an alternative method of identifying teaching qualify. We surveyed 
a consensus panel of third- and fourth-year resident.s at our insti- 
tution who had also been medical students here. These individuals 
would have had five to six years of exposure to the clinical teachers 
at out university, working with a great majoriry c)f them. We also 
chose former students who were re.sidents because they would he 
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Table 1. Least-square mean results on the NBME Subject Exam in Medicine and USMLE Step 2 lor 484 Students Working with internal Medicine 
Attending Physicians Rated on their Teaching as High, Neither High Nor Low, and Low, University of Kentucky College of Medicine, 1993-1999* 



Attending Physicians’ 
Ratingsf 


No. of Students Who 
Worked with an 
Attending of this Level 


NBME Subject Exam in Medicine Score 
(R’ = 0.44) 


Totai USMLE Step 2 Score 
(R* = 0.57) 


Score (SD) 


p (between Scores) 


Score (SD) 


p (between Scores) 


High 


219 


■SB 


.007 


207 (23) 


.015 


Neither high not low 


220 




— 


203 (22) 


— 


Low 


45 


464 (90) 


— 


199 (22) 


— 



’Least-square mean results, which represent the predicted score for a student on the test, are controlled for USMLE Step 1 score, 

tHigh- and low-rated attending physicians were those so rated by consensus of a panel of 15 residents who had formerly been students at the University of Kentucky College of 
Medicine. 

irEighteen students who had worked with both a high-rated and a low-rated attending were excluded. 



most familiar with the special needs and expectations of our inter- 
nal medicine clerkship. We gave these residents a list of all faculty 
in internal medicine who had superv’ised more than five medical 
students during the six-year period. The threshold of five students 
was chosen because this was the number of evaluations we calcu- 
lated were needed to achieve conventional standards of reliability 
for our clerkship’s teaching evaluation form (greater than 0.80), 
and it would help identify' those attending physicians for whom we 
had precise measures of their teaching ability. We asked the resi- 
dents to confidentially rate faculty “high" if they would expect 
them to be rated among our best teachers, “low” if they were among 
our w'orst teachers, and “medium” if they would be in between. A 
priori, we defined “be.si” attending physicians as those that were 
named high rated instructors by 80% of the residents (at least 12 
of the 15 residents) and were not mentioned as a low-rated instruc- 
tor by any resident. Conversely, realizing the tendency for learners 
to rate even the worst instructors at least mediocre, we defined the 
“worst” attending physicians as those who were rated in the low 
category' by at least five of the 1 5 residents, and who were not rated 
high by any of the residents. 

For this study, students’ evaluations of attending physicians’ 
teaching quality were also collected over the six years, as further 
evidence of the validity for our consensus panel opinion (one 
would expect the instructors who were highly rated by residents’ 
consensus to also have high teaching ratings if the consensus pro- 
cess is valid). Our measure of attending physicians’ teaching quality 
was from confidential, end-of-month student evaluations, which 
were completed prior to the students’ receiving their grades. The 
form consists of 16 items on a five-point Likert-type scale (I = 
strongly disagree, 5 = strongly agree). Items included ratings of 
teaching skills and ability, rapport with learners and patients, over- 
all rating, and ratings of their role modeling. The coefficient alpha 
for the evaluation form is .96, This means that there is a high 
degree of internal consistency among items for rating teaching, and 
that the instnjment is a reliable measure of teaching. However, this 
also means that inter-item correlations are very high, for our form 
.75 to .95, which is not unusual for measures of clinical teaching.’* 
Because of the hig’n inter-item correlations, we used the mean rat- 
ing across all items as one measure of teacher “ability.” The overall 
rating an instructor was assigned in our data set was the mean of 
all the ratings from the students he or she worked with in the six 
academic years. 

Our analysis used multiple regression approaches from the gen- 
eral linear model.'' Our dependent variables were scores on the 
National Board of Medical Examiners (NBME) subject examina- 
tion in medicine, taken at the end of the clerkship, and USMLE 
Step 2 scores. Independent variables included dummy coded vari- 
ables for different categories of attending exposure (i,c., high-rated 
versus low-rated versus neither high- nor low-rated attending phy- 
sician exposure). We also included USMLE Step I scores in the 
model as a control variable for prior student academic achievement. 



Results 

Data were collected from 502 third-year medical students (100% 
of students) over the six academic years, We excluded 18 students 
who had worked with both a high-rated and a low-rated instructor 
(as our m(xlel was less clear about how this interaction might in- 
fluence student learning), for a final sample of 484- A total of 46 
attending physicians had more than five student evaluations over 
the six-year period and were included in the list that the consensus 
panel rated. 

Overall, ten faculty met the criteria to be rated “high.” Eight of 
the ten were raced high by all residents, and the other two by 13 
and 14 residents, respectively. Four of the ten were general inter- 
nists. Eight were men and two were women, which reflects our 
faculty’ demographics. Five faculty met our consensus criteria for a 
“low” rating, including one general internist and one woman. 

Teaching evaluations were received from 96% of the students. 
The overall mean teaching rating of the teachers rated high was 
4.68 on the five-point scale (SD = 0.22, range 4.23--4.94). For the 
“medium” group, the mean teaching rating was 4-34 (SD = 0.32, 
range 3.4-4.92). For the “worst”-rated attending physiciar\s, the 
mean rating was 3.56 (SD = 0.48, range 3 06-4-21). Mean differ- 
ences between groups were highly statistically significant (p < 
O.OOl). Forty-five students had had exposures to at least one low- 
rated and no high-rated attending physician; 219 had had exposures 
to at least one high -rated and no low-rated attending physician; 
and 220 had had exposures to neither a high- nor a low-rated at- 
tending physician. Our high -rated attending physicians were more 
often attending physicians on the general medicine inpatient ser- 
vices than were the low- or medium- rated faculty, hence the dis- 
parity in numbers of students per faculty. 

Table I presents the least-square mean scores on the post-clerk- 
ship NBME subject examination in medicine and on USMLE Step 
2, depending on exposure to high-, low-, or medium-rated instruc- 
tors (least-square means are predicted means adjusted for USMLE 
Step 1 scores). As can be seen, students who worked with at least 
one of our consensus panel’s highly rated instructors scored signif- 
icantly higher on the post-clerkship NBME examination in medi- 
cine and USMLE Step 2. 

Conclusions 

Out findings once again confirm the association of better clinical 
teaching with better student examination performance, demon- 
strating in a quantitative fashion the outcomes of teaching. TTic 
etfcct sizes in this current project arc much more substantial than 
tho.se in our prior reports, amounting to one-fourth to one-third of 
a standard deviation, or, for example, up to seven or eight points 
on USMLE Step 2, versus the one-sixih to one-seventh standard 
deviation effect sizes of our prior reports. We attribute our stronger 
conclusions to the more refined merN'id of this current project. 
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First, our previous reports included all instructors, regardless of the 
numbers of students they had taught, and therefore all faculty were 
eligible to be included in the high- or low*-rated category even if 
they had few student evaluations. For example, we may have in- 
cluded in our high category those faculty' with only two or three 
ratings that were all high, when over time their ratings might have 
regressed towards a more stable mean that did not qualify' them as 
such. In essence, we were categorizing some instnjctors as better or 
worse by using imprecise measures of their teaching ability'. This 
imprecision would tend to add background “noise" to the analysis, 
attenuating our findings and effect sizes. Second, we disentangled 
learner outcomes from ratings by learners with our residents’ con- 
sensus panel. As shown, attending physicians who were rated highly 
by the residents’ consensus panel had significantly higher teaching 
ratings than did the medium- and low-rated instructors. Our pre- 
vious method, relying on categorization solely by teaching rating, 
may have led to the exclusion of some otherwise excellent clinical 
teachers simply because they did not quite meet the “top 20% of 
student evaluations" we had required in our previous report to be 
considered a highly rated instructor. 

Our findings seem to indicate that clinical teaching has an in- 
fluence on outcomes, such as performance on USMLE Step 2. One 
might wonder how a short four-week exposure in a single discipline 
could influence USMLE Step 2 scores to such a degree, given that 
USMLE Step 2 comprises a wide variety of disciplines. Our answer 
is suggested by our model. From our experience as learners, the 
influence of a single outstanding instructor on one’s approach to 
learning should not be underestimated. We suspect that the best 
teachers do not necessarily impart more factual information (facts 
which may he obsolete in a few years), but rather they engender a 
learning climate that makes learning fun, enjoyable, and exciting. 
They may do this by their example, by modeling the process of 
lifelong learning, by the joy they bring to their teaching, or by 
combinations of qualities such as these. Regardless, the learner’s 
approach to learning is in some fundamental way changed, canynng 
over to the other clerkships and, we hope, to residency and beyond. 
Further studies should investigate the influence of outstanding 
teachers on life-long learning. 

Several limitations to our study should be kept in mind as one 
interprets our results. This is a single-institution, single-discipline 
study, and certainly national studies are needed to assert the gen- 
eralizability of our findings, as well as studies in other disciplines. 
In addition, our study focused on but one outcome measure, stu- 
dents’ perfonnances on NBME-type examinations, which measure 
but one aspect of clinical ability (knowledge). Future research 
should investigate the influence of teaching on other student out- 
comes, such as clinical skills, attitudes towards patients and the 
profession, and doctor-patient communication and relationships. 
Finally, this project’s method did not lend itself as well to measuring 



the influence of residents’ teaching on students’ outcomes, so fur- 
ther studies are needed. 

Nevenheles.s, despite these limitations, we conclude that attend- 
ing physicians’ teaching quality' can have a measurable impact on 
students’ examination performances. We therefore believe it is pos- 
sible to begin considering learners’ outcomes as an important mea- 
sure of faculty’s teaching ability, perhaps (with more study) an im- 
ponant addition to teaching portfolios and promotion dossiers. But 
even more, w'e believe our findings add to the growing literature 
on the critical importance of the educational mission that indicates 
students’ learning would be jeopardized if the educational mission 
were to be compromised for fiscal reasons. 

Tliis research was supported by a grant from the Natiunal Ix\if J of Medical Examiners 
Research Fund, 56-9798. 

Correspondence: Charles H. Griffith III, MD. M5PH, Dupartment oflntemal Medi- 
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• PLENARY— OUTSTANDING RESEARCH PAPERS 



Moderator: Karen Mann, PhD 



When Residents Talk and Teachers Listen: A Communication Analysis 

JUDY L. PAUKERT 



Communication is nor only an exchange of ideas but also a form 
of social behavior that negotiates relationships. How two parties 
talk with each other reveals their relative status, level of rapport, 
and value for each other. Not suiprisingly, the power that a speaker 
derives from his or her status may jeopardize a conversation. The 
teacher’s role, particularly as evaluator, often leads the teacher to 
dominate conversation with the learner. In one-on-one teaching 
and other dyadic interactions, the less powerful party expects to 
adapt to the dominant part^’’s speech and initiations.' When ad- 
aptation is extreme, communication is authoritarian; the more 
powerful party rejects the less powerful party’s speech by interrupt- 
ing, taking over, or monopolizing the conversation. When adap- 
tation is minimal, communication is autonomous; the more pow- 
erful party encourages the other to dominate or lead the 
conversation by verbal and nonverbal behaviors. 

Although several studies have demonstrated that residents per- 
ceive autonomy as important to their learning, research about the 
effects of interactions between re.sidents and attending physicians 
on the development of clinical independence has produced con- 
tradictory findings.'*’^ No study has described how autonomy 
emerges from communication between residents and their teachers. 
Analysis of conversations between teachers and learners has gen- 
erally been limited to determining the amount and duration of 
contact. Most content analysis has focused on the topics discussed 
and categorization of utterances.'' However, analysis of another dy- 
adic interaction, physician -patient communication, has identified 
several distinct patterns based on communication control and ver- 
bal dominance.''^ 

An in-depth examination of communication patterns between 
the physician-teacher and the physician-in-training may increase 
our understanding of the types of interactions that help residents 
learn. This study analyzed how preceptors and residents interact 
during teaching encounters in ambulatory pediatrics primary’ care 
settings. This study focused primarily on autonomous communica- 
tion when residents dominate the conversation. 

Method 

The Institutional Review Board approved this study, which was 
conducted in the continuity care clinics of the general pcdiatric.s 
residency program at Baylor College of Medicine, Houston, Texas. 
The study involved both academic and community (private prac- 
tice) sites. Preceptors were selected based on diversity of teaching 
reputation, teaching and pediatrics experience, interpersonal skills, 
and practice setting (solo to large group). The final .sample was 
made up of six academic and seven community preceptors. Four to 
nine clinical teaching encounters were observed and audiotaped for 
each preceptor. Each encounter was a unique opportuniry to cap- 
ture a communication pattern. In all, 76 preceptor-resident inrer- 
acrions were analyzed using the grounded-theory method.*" An ex- 
perienced educator re-examined and independently coded a portion 
of the transcribed encounters. Intercoder agreement was about 
95%. Parricipating preceptors were also a.sked to confirm ihe anal- 
yses. 

Rc.sults 

The encounters included acute care, folknv-up, and well-child visits 
and involved first-, second-, and third-year residents. F<>ur distinct 



patten'is of communication were identified based on conversational 
input and verbal dominance. Of 76 interactions, 54 (71%) showed 
a conversational balance between speakers: 47 mutual (high pre- 
ceptor and high resident inpur) and seven default (low preceptor 
and low resident input). The remaining 22 interactions showed 
imbalances between speakers: 15 autonomous (high resident and 
low preceptor input) and seven authoritarian (high preceptor and 
low resident input). 

Almost 20% of the interactions w^erc classified as autonomi>us. 
Of these, 12 (80%) occurred in community settings. Two academic 
and four community preceptors engaged in autonomous interac- 
tions. No preceptor relied on autonomous interactions exclusively, 
although one academic and one community preccpttir used only 
authoritarian communication. Thus, 1 1 of the ! 3 preceptors used 
more than one communicatitm pattern. 

Further analysis of autonomous interactions revealed specific pre- 
ceptor behaviors. In every autonomous interaction observed, the 
preceptor recognized the resident’s “expertise” and allowed the res- 
ident to dominate communication during the interaction or, at 
least, the conversation about the patient. Generally, the preceptor’s 
approval resulted from the preceptor’s identif^’ing the resident’s 
level of understanding as appropriate for a case. The examples re- 
ported in the following sections represent behaviors observed across 
the series of autonomous interactions. 

Probing Quesfions. Preceptors used probing questions to assess 
residents’ understanding. In one community encounter, a first-year 
resident presented an 8-year-old child complaining of nighttime 
coughing and congestion related to physical exertion. After listen- 
ing to a concise but detailed exposition of subjective and objective 
findings, the preceptor asked, "What do you think of his sequelae 
to his respiratory' infection, exercise cough, and that kind of tiling.^’’ 
The residents’ response confirmed a level of understar ding appro- 
pri'circ for diagnosing the patient’s condition: 

R; Well, it’s pretty likely that he has some kind of twitchy ainvay. 
He’s had a recent infection and recent irritation to hi:> lung and he 
just had another little cold. So anything that be might get on top of 
it might cause him tn have a little hit tighter air flow. So m.iyhe at 
night, land] that might he one aspect of reactive ainvay disease, es- 
pecially when he’s active. 

In another encounter, an academic preceptor used variations of 
the sample probe throughout a third-year resident’s presentation tif 
a Sy.'-ycar-old girl with Angelmans syndrome and a febrile seizure 
disorder. The preceptor began probing after the resident had fin- 
ished pre.scnting the subjective findings: 

R: Also Ishc was) seen by Ncun.>log>” for a hi.'^n.iry i>f febrile seizures 

.seen with mfcciions. She’s been on I'k-pakcne. She has been on sev- 
eral mcdieine.s. First Dilantin liquid, then Dilantin tablets that were 
criishablc, hut she had a lot dnuiling and would drixil mil mosi of 
u. tVpakcnc, first ilie sprinkles .mJ now the elixir. 

P: They really thought tliis wus a seizure disorder and not just febrile 

>eiznre.s.’ 

R: Thoughr so. Ibxiking through chart tor Neurology' cnnyl Im- 

prcssion i.s seizure disorder, febrile. Recommended an EEC jelcctro- 
enccphalograml which Mom said was done hut u.i> noi a gixxl study, 
and hasn’t iK’en seen in (he |Ncuro!ogy) (dinic in aKmt rwo years. 

76 
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Mom warns ui take her off the Depakene and I told her that I would 
want her to he seen by the physicians IneurologistsI here. 

In this encounter, the resident responded to the question by giv^ 
ing an “expert*’ answer: citing the chart entry' made by Neurology*. 
Then, the resident elaborated by expressing the need and rationale 
for obtaining a blood test to deternune whether (1) the current 
dosage of anticonvulsant was therapeutic and (2) stopping the med- 
ication would do no harm. The preceptor’s remarks to the resident’s 
concise assessment and plan for this complicated patient with mul- 
tiple medical problems signaled support of the resident’s autonomy: 

R: So, impression is history of Angelman’s syndrome, Jovelopmentnl 

delay, history of complex febrile seizures, left exotropia, and [patient] 
would probably henetit from visits back to her multiple siihspccialisrs. 

P: So, we’re going to check her Depakote level today and then 

maybe decide labour taking her off Depakote]. And you want her to 
go to Neurology as well. 

The preceptor signitied concurrence with the resident’s plan by 
using “we” to show agreement (“we’re going to check her Depakote 
level today and then maybe decide”). Similarly, the preceptor con- 
firmed the resident’s autonomy and dominance by using the pro- 
noun “you” want her to go to Neurology' as well”). 

Inference. In other encounters, the preceptor inferred the resi- 
dent’s level of knowledge and understanding from the organiration, 
thoroughness, and conciseness of the case presentation, a finding 
also demonstrated by Irby.‘' The case presentations of the only 
third-year resident observed in the community setting (four inter- 
actions) were so well organized and articulated that the preceptor 
rarely commented other than to agree with the resident’s findings. 
The resident almost monopolized the conversation, with a smooth, 
confident, and complete case presentation in SOAP format (sub- 
jictivc-ohjcctive-assessment-plan). The preceptor spoke only 
when the resident paused and showed agreement by using minimal 
reinforcers, such as “all right,” “okay,” and “that sounds right.” 
Conversationally, minimal reinforcers cue the .speaker that the lis- 
tener is involved and following the speaker’s thoughts." In a clin- 
ical interaction, these utterances also cued the resident that the 
preceptor was willing to allow the resident to dominate the talk 
and the encounter. 

Besides smoothness, proper terminology', and adherence to SOAP 
format, other characteristics of the presentation permitted a dififer- 
ent community preceptor to assess a first-year resident’s level of 
knowledge. In this encounter, the resident was confident hut more 
relaxed in style and language than the previously described third- 
year resident. Satisfied with the resident’s presentation of the sub- 
jective and objective findings, the preceptor asked for the resident’s 
assessment and plan. The resident replied, “Her right TM looks 
like really white. 1 just, I guess that there’s pus behind it. ... It 
looks way different than the left side. ... So right otitis. And since 
she’s never had any problems before, just do amoxicillin.” Although 
not eloquent, the resident’s response brought agreement from the 
preceptor and closed the encounter. 

Admitiedly, without probing rhe extent of a resident’s knowl- 
edge, a preceptor might wrongly infer a resident’s understanding 
was appropriate. By engaging in the behavior of “showing,” that is, 
confidently presenting findings and knowledge of certain entities of 
a case, a resident might hide actual deficiencies of other entities 
wiihin the same case. The preceptor’s own knowledge of a resident’s 
past performance, particularly for a disease or family of diseases, 
may prevent some mistakes. For example, in one encounter, an 
academic preceptor asked a second-year resident to limit the case 
presenration and “give the big points.” The resident condensed the 
subjective and objective findings into two sentences: “These kids 
are here for well-child checks. The hottcmi line is that the older 




The resident’s response probably tapped into two important pieces 
of information available to the preceptor. First, the preceptor knew 
w'hat a second-year resident should know about lice and ringworm, 
both common pediatric problems. Second, the preceptor knew this 
resident specifically from interacrions over the preceding two years. 
Despite this knowledge, the preceptor Ustcr.ed attentively through- 
out the resident’s speech and offered minimal reinforcers like “oh 
no” and “okay.” 

Nonverbal Behaviors. In rhe 15 autonomous interactions, all pre- 
ceptors listened attentively and used verbal and nonverbal beha\*- 
iors to indicate that they followed the residents’ reasoning and talk. 
Nonverbal behaviors, such as eye contact, facial expressions, head 
nods, and alert body posture, more accurately disclose how well a 
party is listening to a conversation than do verbal behaviors. ‘‘ An- 
other important nonverbal behavior ohser\'ed was the control of 
the patient's chart. In most autonomous interactions, the resident 
controlled the patient’s chart. Controlling the chart prevented the 
preceptor from reading the chart during case presentation and di- 
verting the conversation to an unrelated chart entry, behaviors ob- 
served in the sev en authoritarian interactions characterized by pre- 
ceptor dominance. In both autonomous and authoritarian 
interactions, the dominant speaker controlled the patient’s chart. 

Absence of Teaching Scrips. Preceptors did not use a clinical 
teaching script'^ in any autonomous interaction. Fatigue or time of 
an encounter, such as the last encounter of a latc-running clinic, 
might have affected a preceptor’s decision not to use a clinical 
teaching script, hut instead to permit the resident’s autonomy. Ar- 
guably, a preceptor who failed to assess a resident’s understanding 
might hav’e missed an opportunity to use a teaching script. 

A.t least one autonomoits interaction exemplified how a clinical 
teaching script might have been less effective than the teaching 
created by follow-up of the case itself. This encounter was the fol- 
low-up visit of a “fussy” 2-week-oId child, who had been consoled 
only by feeding when initially seen by the same academic preceptor 
and first-year resident. The initial teaching encounter, which cor- 
responded to the first visit, contained a clinical teaching script re- 
garding diagnosis and management of suboptimal weight gain in a 
baby. At that time, the preceptor and the resident negotiated a 
treatment plan to increase the number of feedings and rule out 
gastroesophageal reflux as an organic cause for fussiness and poor 
weight gain. Between the initial and follow-up visits, the patient 
had been x-rayed and started on appropriate medication for reflux 
hy another preceptor and resident. By the follow-up visit (and sec- 
ond teaching encounter), the resident was able to learn from the 
case itself and to see the results of the patient’s treatment. The 
resident’s speech reflected understanding of the case and delight 
with the patient’s improvement: 

R: She did not cr>’ once, the whole rime I was in there. 

P: YouTc kidding. W«s thi.s the same baby we saw one week ago 

jlnughingl? 

R; No. 1 was thinking, this was a completely dift'ercnt baby. You 
know', when I examined her, I did everything. You know.' She’s alert, 
loc'king around. I moan she wnsn’i lethargic. She vias fine, scrcMming 
at the top of her lung>. 

P: Wonderful! 

R: Mom says that she’s not fussy at home the way she Ivad been. 

This is the way she is like at home as well. IShc) is breastfeeding 
aKiur ev'cry 2 to 2'/: hours, 15 miniitos on each breast. Secm.s satisfied 
after each feed. The feedings arc going better since she .started the 
Zantac and Ci.siipridc. 

Observing the post -treatment changes in ilie patient reinforced 
the preceptor’s earlier teaching script. This pair of encounters also 
, demonstrated the benefits of the continuity care experience in 
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which a resident develops and maintains a continuing physician — 
patient relationship with a panel of patients. The preceptor’s ex- 
perience with a resident and that resident’s panel of patients may 
increase the likelihood that the preceptor will allow the resident 
greater autonomy in the teaching interaction. 

Conclusions 

Autonomous preceptoi -resident communication is characterized by 
high resident and low preceptor input and preceptors’ behaviors 
that confirm and recognize the residents’ speech. The preceptor 
assesses the resident’s level of understanding in a particular case 
through questioning or inference from the organization, thorough- 
ness, conciseness, and confidence of the resident’s speech. Twelve 
of the 15 autonomous interactions occurred in community settings. 
The reasons for this difference in the rate of autonomous interac- 
tions between academic and community settings were not identi- 
fied. Conversation is a response nor only to a person but also to 
environmental conditions, such as the available time and space for 
teaching. Economic factors may encourage community preceptors 
to permit autonomy rather than provide more directed teaching. 

Academic and community preceptors were alike in other re- 
speers, particularly the use of multiple communication patterns. 
This finding suggests a spectrum of the relationships between 
teacher and learner that encourage different conversational behav- 
iors. The use of multiple communication patterns may indicate 
“scaffolding,” an overarching process within the preceptor-resident 
relationship in the continuity care experience.*^ 

Scaffolding refers to techniques that support learners in their 
efforts to solve difficult problems or perform difficult tasks. For a 
novice resident, a preceptor may behave authoritatively to provide 
maximum support by modeling desired behaviors, such as how to 
perform an examination or give anticipatory guidance. As the res- 
ident’s experience and skill in relating to the preceptor and patients 
increase, the need for support decreases. ITius, the preceptor be- 
haves less authoritatively, and more collaborative and autonomous 
interactions occur. 

Scaffolding requires the preceptor to know what support a resi- 
dent truly requires. The space and time available for teaching may 
increase the preceptor’s reliance on autonomous interactions, even 
when there is a recognizable teaching moment. In this study, two 
extremes were found: some first-year residents were involved in au- 
tonomous interactions and some third-year residents in authoritar- 
ian interactions. Possibly, preceptors select the amount of support 
to give a resident based on the resident’s specific experience w'ith 
a problem. Whitman and Schwenk*'* seem to advocate “selective” 
scaffolding by suggesting that medical teachers alternate between 
assuming active and passive, roles, depending on learners’ needs. 
Training preceptors in scaffolding techniques is not likely to elim- 
inate default communication patterns, particularly when fatigue un- 
dermines conversation. 

This study is limited because it was performed in a pediatrics 
setting. Despite its setting, the study has potential implications for 
all clinical teaching that involves one-on-one interactions between 



learner and teacher. It is probable that the communication patterns 
identified may be observed in other practice areas because this study 
did not limit participation to exemplary teachers and sampled for 
diversity, 'fhe effect of observation on participants’ behaviors can- 
not be entirely discounted- However, over a clinic session, both 
preceptors and residents seemed to forger that they were being ob- 
served. 

Future studies should delineate how different communication 
patterns affect teaching and learning. The relationship of the teach- 
ing interaction to future physician — patient communication also 
deserves investigation. For example, do the. ways that preceptors 
talk with residents predict the ways chat residents talk with pa- 
tients? 

Teaching preceptors active listening and supportive verbal and 
nonverbal behaviors may benefit both patients and residents. When 
preceptors carefully attend to their residents’ presentations, by lis- 
tening and acknowledging the residents’ concerns and opinions, 
they model how physicians should attend to their patients’ con- 
versations, by listening and acknowledging patients’ concerns. 
Studying the conversation of clinical teaching should illuminate 
bow physicians- in-training learn the- conversation of healing. 

Correspondence: Dr. Paukert, Mail Code 7737. Departajcnt of Surgery, The University 
of Texas Health Science Center, 7703 Floyd Curl Drive, San Antonio, TX 78229' 
3900. Reprints arc not available. 
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The Relationship between the Nature of Practice and Performance on a Cognitive Examination 

JOHN J. NORCINl and REBECCA S. LIPNER 



Certification by a specialty hoard affiliated with the American 
Board of Medical Specialties is one of the most widely used markers 
of physician competence in the United States.' To enhance the 
meaning of this credential and in recognition of the need for pC' 
riodic reassessments of physicians, the specialty boards have time- 
limited their certificates. To maintain certification, most of the 
boards have developed programs that incorporate (1) a check of 
credentials, (2) self-evaluation and/or continuing medical educa- 
tion, and (3) a secure written or computer-based examination. The 
secure examination is considered an integral component of certi- 
fication because it provides assurance that physicians arc keeping 
up with changes in medical knowledge and that they possess the 
ability to successfully manage patients’ problems that are important 
but rarely encountered in practice. Moreover, from patients’ and 
payers’ perspectives, a secure examination lends more credibility to 
the certification process. 

Despite these benefits, the secure examination is contentious be- 
cause many believe that it tests only the ability to recall factual 
knowledge and, as such, bears little relationship to the day-to-day 
practice of medicine.* However, the growth of electronic records 
and databases has made it possible to begin to address this concern 
by comparing certification status and test performance with aspects 
of practice such as volume, process of care, and patients’ outcomes. 

There is considerable evidence that physicians who treat large 
numbers of patients with a particular condition generally provide 
better care for such parienrs. Volume is directly associated with 
patients’ outcomes, regardless of the discipline or procedure.’ 
Therefore, the relationship of practice volume to examination 
scores is an important part of any test-validation effort. A study 
involving the first geriatric medicine certifying examination indi- 
cated that the number of geriatric patients seen in practice was 
positively correlated with examination scores.’’ Likewise, a study- 
involving cardiologists indicated that performance on cardiovas- 
cular graphics questions was positively correlated with experience.^ 
Specifically, scores on the interpretation of echocardiograms were 
correlated with the numbers of echocardiograms interpreted in 
practice or training. Similarly, scores on the interpretation of ar- 
teriograms and ventriculograms were correlated with the numbers 
of angioplasties performed. More recently, a study of a cognitive 
recertification examination in critical care medicine showed that 
scores were related to the amounts of rime physicians spent in the 
direct care of critically ill parients. This relationship persisted even 
after statistically removing performance on the initial certifying ex- 
amination in the same discipline.^ 

There is also evidence that certification status and examination 
performance arc related to the process of care. A study of physicians 
in Quebec compared consultation rates, inappropriate prescribing 
for the elderly, and mammography screening rates with licensing 
examination scores ha.scd partly on cognitive test.s.’ Physicians wiih 
higher scores referred more patients for consultation, prescribed 
fewer inappropriate drugs and more Jiseasc-spccific mcdicacions for 
symptom relief, and appropriately referred more women for mam- 
imigraphy. Similarly, a study of certified and non-cerrified inromists 
found differences in prevemive care services favoring the certified 
physicians.^ 

Although it is difficult to measure good practice outcomes well, 
some progress is being made in using available data in the valida- 
tion of cognitive examinations. A recent study inve.stigatcd 



whether there were differences among certified and self-designated 
cardiologists, internists, and family practitioners in the mortality- of 
their patients with acute myocardial infarction.*^ Data for all 
myocardial infarctions for calendar year 1993 in Pennsylvania were 
analyzed. Certification was associated with a 15% reduction in 
mortality irrespective of specialty', and after taking account of se- 
verity of illness, hospital characteristics, patient volume, and years 
since graduation. Similarly, Ramsey and colleagues found differ- 
ences in some outcomes that favored the certified physicians,'*' and 
there are a few studies with similar positive results in other spe- 
cialties. 

Given the significance of the topic and the need for additional 
investigation, the purpose of this study was to extend previous work 
by exploring in more detail the relationship between examination 
performance and the nature of practice. Specifically, candidates for 
recertification in critical care medicine .supplied information, via a 
practice survey, about the amounts of time that they spent in the 
care of patients with cardiovascular and pulmonary’ problems (i.c., 
practice volumes). Moreo\'c:, they rated the complexity of the 
problems they saw. These practice data were compared with per- 
formances on the items from the examination that dealt specifically 
with cardiovascular and pulmonary problems ro determine whether 
parienr complexity, in addition to patient volume, was associated 
with test scores. 

Method 

Parridpants. The data are based on the candidates who at- 
tempted the 1997 and 1999 recertification examinations in critical 
care medicine and responded without error ro the practice surv’ey. 
All of these candidates had time-limited initial critical care certif- 
icates. Ninety-nine percent and 93% of their certificates expired 
in 1997 and 1999, respectively. In 1997, the average examinee 
had been certified in internal medicine in 1979 (18 years, SD = 
4 years), and these physicians spent most of their time in direct 
patient care (mean = 70%, SD = 26%). In 1999, the examinee 
group’s average candidate had been certified in internal medicine in 
1982 (17 years, SD = 4 years), and the.se physicians also spent the 
majority of their time in direct patient care (mean = 72%, SD = 
26%). 

Examinations. The 1997 and 1999 critical care medicine recer- 
tifying examinations each consisted of 120 single -best-answer ques- 
tions, all of which were asked in the context of a clinical problem 
and most of which required synthesis and judgment to reach the 
correct response. Consistent with the purpose of the tests, the con- 
tent focused on well-established principles of patieni care that 
should be known without consulting medical resources. 

The questions were written by a test committee of experts, hut 
before these quc.stions were selected for the cxaininarions, they 
were sent to critical care practitioners who rated them for relevance 
ro practice. The examinations had average relevance ratings of 
more than 4 on a five-point raring .scale, where 5 denoted ‘'very 
relevant.” Tlic same items appeared on the 1997 and 1999 critical 
care medicine initia/ certifying examinations. 

Tins study concentrated on the 1997 and 1999 exams’ cardio- 
vascular and pulmonary disease questions because problems in these 
arca.s' were frequently encountered in practice by candidates, and 
they were the largest subsets of items on the examination. In ad- 
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dition, these examination years had substantial numbers of candi- 
dates taking the examination. Table 1 presents the numbers of 
items and their means (SD) for the 1997 and 1999 critical care 
medicine recertification examinations. For this study, subtest scores 
were reported on the raw score scale. 

Since scores on initial certifying examinations are related to fea- 
tures of residency and fellowship training as well as fund of medical 
knowledge, the initial certifying examination in critical care medi- 
cine was used as a statistical control in this study.’ ^ The analyses 
took account of the scores on this examination and attributed ef- 
fects to other variables if they made independent contributions to 
the explanation of the performance. The scores on the initial cer- 
tifying examination had been standardized agair^st a national group 
with a mean of 500 (SD = 100) and were equated over years using 
a common-item linear equating technique. Table I presents the 
means and standard deviations for each cohort. 

Survey. When physicians applied for the examination, they were 
asked to supply information about their practices. Specifically, for 
each of the specialties of medicine, they were asked what percent- 
age of time they spent with patients. Physicians whose responses 
did not amount co 100% were removed from the analysis. Candi- 
dates were also asked to rate on a five-point scale (where 1 was 
"not very complex” and 5 was “very complex”) the complexity' of 
the cardiovascular and pulmonary' disease cases they managed. Fi- 
nally, to test the joint effect of time and complexity, the ratings 
were multiplied by percentage of time spent in an area. Table 1 
presents descriptive data for these variables. 

Procedure. The data were submitted to four separate stepwise lin- 
ear regressions, Dvo (1997 and 1999) for cardiovascular disease and 
two (1997 and 1999) for pulmonary disease. The dependent mea- 
sures were the cardiovascular and pulmonary subscores on the crit- 
ical care medicine recertifying examination and the independent 
variables were (1) score on the initial critical care medicine certi- 



Table 1. Descriptive Data and Regression Result for Physicians Who 
Took the 1997 and 1999 Critical Care Medicine Recertifying 
Examinations and Completed a Survey on Their Practices' Characteristics 





1997 
(/? = 510) 


1999 
(n = 334) 


Certifying exam scores, mean (SD) 


572 (59) 


568 (58) 


Cardiovascular disease questions 


(n = 38) 


(n = 24) 


Number right, mean (SD) 


30.8 (3.3) 


19.5 (2.3) 


Complexity rating, mean (SD) 


3.1 (1.3) 


3.3(1.3) 


% of time, mean (SD) 


10% (12) 


11% (13) 


Interaction of frequency and complexity, 


37.7 (52,5) 


43.8 (59.5) 


mean (SD) 






p coefficients 






Constant 


16.50 


10.66 


Initial certlfyii g exam subscore 


.025 


.015 


% of time 


NS 


NS 


Complexity of problems 


NS 


NS 


Time-complexity interaction 


.007 


.008 


Pulmonary disease questions 


(n = 14) 


(n = 26) 


Number right, mean (SD) 


11.5 (1.6) 


20.0 (3.2) 


Complexity rating, mean (SD) 


4.5 (1.0) 


4.3 (1.1) 


% of time, mean (SD) 


35% (24) 


33% (24) 


Interaction of frequency and complexity, 


158.4 (113.4) 


150.7 (112.3) 


mean (SD) 






p Coefficients 






Constants 


5.92 


1.83 


Initial certifying exam subscore 


.009 


.026 


% of time 


.007 


.017 


Complexity of problems 


NS 


.71 


Time-complexity interaction 


NS 


NS 






fy'ing examination, (2) the frequency of patients’ problems encoun- 
tered in the area, (3) the complexify of those problems, and (4) 
the interaction of the two factors (i.e., frequency times complexity). 

Results 

in predicting the cardiovascular disorder subscore on the 1997 re- 
certification examination, the critical care medicine certifying ex- 
amination entered first into the regression equation (R^ change = 
.18, t = 10.44, p < .001), followed by the interaction of the fre- 
quency and complexity of patients with cardiovascular disorders (R“ 
change = .01, c = 2.72, p = .02). The other variables did not con- 
tribute significantly. For the 1999 recertification examination, sim- 
ilar results were obtained. The critical care medicine certifying ex- 
amination again entered first (R‘ change = .14, t = 7.74, p < .001), 
followed once more by the interaction of the frequency and com- 
plexity of patients with cardiovascular disorders (R" change = .04, 
t = 3.98, p < .001). Again, the other variables did not contribute 
significantly. 

From these results we can infer that if there were critical care 
physicians who spent all of their time ( 100%) treating patients with 
complex cardiovascular problems they could be expected to perform 
3.5 (1997) to 4 (1999) points better on cardiovascular disease items 
than would those who did not see any cardiovascular problems. 
There were not many physicians in this sample who spent all of 
their time treating such patients, but this constitutes a difference 
of 1-1 to 1.7 standard deviations. 

In predicting the pulmonary disorder subscore on the 1997 re- 
certification examination, the critical care medicine certifying ex- 
amination entered first (R‘ change = .12, t ~ 8.20, p < .001), fol- 
lowed by frequency of patients with pulmonary disorders (R" change 
= .02, t = 2.57, p < .001). The other variables did not contribute 
significantly. For the 1999 recertification examination, the critical 
care medicine certifying examination also entered first (R* change 
= .23, £ = 10.20, p < .001), followed by the complexity' of patient 
problems (R" change = .08, t = 5.16, p < .001), and frequency of 
patients with pulmonary disorders (R^ change = .01, £ = 2.58, p = 
.01). Again, the other variables did not contribute significantly. 

From these results we can infer that if there were critical care 
physicians who spent all of their time treating patients with com- 
plex pulmonary problems they could be expected to perform .7 
( 1997) to 5.2 ( 1999) points better on pulmonary disease items than 
would those who did not see any pulmonary' problems. There were 
not many physicians in this sample who spent all of their time 
treating such patients, but this constitutes a difference of between 
.4 to 1.6 standard deviations. 

Discussion 

The purpose of this study was to extend previous work by exploring 
the relationship between test performance and the nature of prac- 
tice. Physicians who w'ere recertifying in critical care medicine in 
1997 and 1999 supplied information about the amounts of time 
they' had spent in the care of patients with cardiovascular and pul- 
monary problems and the complexity of the problems they saw. 
These practice data were compared with performances on the rel- 
evant items from the examination. 

For cardiovascular diseases, the interaction between volume and 
complexity had a significant relationship with test scores for both 
years of the study, even after controlling for previous examination 
performance. For pulmonary diseases, only volume was a significant 
predictor in 1997, but both volume and complexity’ were significant 
in 1999. The magnitude of the effects was noteworthy, ranging from 
.4 to 1.7 standard deviations. 

These results should be interpreted with care because this study 
has several limitations. First, the number of questions in each con- 
tent area was relatively small, and this attenuated rhe correlations 
that were reporred. Second, rhe estimates of time and complexity 
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were based on self-reported data, and a number of physicians made 
errors in filling out the form. Patients’ records would clearly be a 
more accurate and less biased source of these data. Third, critical 
care medicine is a relatively new discipline and the vast majority 
of diploma tes are certified in pulmonary disease. Therefore, these 
results may not generalize to a more homogeneous, less cross-dis- 
ciplinary field. 

Despite these limitations, the results of this study replicate pre- 
vious work indicating that cognitive examination test scores are 
associated \v\zh patient volume, and by implication from other stud- 
ies, with patient outcomes. The study also found that there is a 
relationship between scores and the complexity' of problems phy- 
sicians see in practice. This finding bears more investigation, but 
it seems sensible that a practice that includes the challenge of treat- 
ing many complex patients should lead to more knowledge and 
better judgment on the part of the physician. 

These findings, taken together with previous work, suggest that 
performance on a cognitive examination is related to performance 
in practice. Of course, this type of examination is not a substitute 
for rigorous evaluation of practice outcomes, nor is ir broad enough 
to include important aspects of competence such as communication 
skills and professionalism. Nevertheless, until better measures arc 
available for high-stakes use, the cognitive examination is a rea- 
sonable alternative.’’* When such measures become available, there 
will still be a place for cognitive assessment of new developments 
in medicine and for patients’ problems that are important but in- 
frequently encountered in practice. 

The American Board of imcmal Medicine supported ihis rc.\carch but it docs not 
necessarily reflect its views. Corrcspimdcnce: John J. Norcini, PhD, Institute f'^r Clin- 
ical Evaluation, 510 Walnut Street. Suite 1700, Philadelphia, PA 19106-3699. 
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• TRUTH AND CONSEQUENCES 
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Validity of Faculty Ratings of Students’ Clinical Competence in Core Clerkships in Relation to 
Scores on Licensing Examinations and Supervisors’ Ratings in Residency 

CLARA A. CALLAHAN, JAMES B. ERDMANN. MOHAMMADREZA HOJAT, J, JON VELOSKl, 
SUSAN RATTNER, THOMAS j. N^ SCA, and JOSEPH S. GONNELLA 



Connections between assessment measures in medical school, res- 
idency, and practice need i be studied in order to ascertain the 
validity of such assessments in the continuum of medical education 
and physician training.’ ‘ Assuring the validity of students’ clinical 
competence ratings is especially important because these assess- 
ments are among the major components of the dean’s letter of 
evaluation and, as such, are used in the ranking of candidates for 
residency programs. 

Medical schtx^ls expend considerable time and effort in preparing 
a dean’s letter for each of their graduating students. It is based 
largely on the faculty’s assessment of the student’s academic and 
clinical pcrfonnance. It should he one of the most important at- 
tachments to students’ applications for graduate medical education. 
Despite this, residency directors may not attach much importance 
to the dean’s letter,’ in part, perhaps, because they are uncertain 
that the information contained within it is valid for predicting 
pertdrmance during residency. 

Previous surveys have indicaied that academic criteria such as 
U.S. Medical Licensing Examinations (USMLE) scores, member- 
ship in Alpha Omega Alpha (AOA), the medical honor society, 
and class rank"’ *’ were rated highly as selection variables by resi- 
dency directors. More recently, performance during clinical clerk- 
ships has been cited as an important factor.’"'’ particularly in the 
^pccialty for which the studenr is applying, and especially for the 
most competitive residencies ‘ It is thus increasingly important to 
confirm the validity of clerkship evaluations to assure the credibility 
of the dean’s letter as a predictor of postgraduate performance. 

The dean’s letters of evaluation from Jefferson Medical College 
include a broad range of information (USMLE Step 1 score, sec- 
ond- and thiiJ-year cla.ss ranks, histogram of third-year written ex- 
amination grades, clinical ratings, and excerpts from the narrative 
evaluations from the third-year clerkships). V^e have previously 
documented the validity of a calculated medical .school class rank 
in predicting postgraduate performance.’'^’ 

The purpose of this study was to examine the validity of faculty 
ratings ot .students' clinical competences in six core clinical clerk- 
ships in relation to the students’ subsequent performances on med- 
ical licensing examinations and to program directors’ ratings of 
clinical performance in the tir.^i year of re.sidcncy. 

Method 

Study participants were 2,1 58 smdonts nr Jefferson Medical C^^llcgc 
who graduated between 1989 and 1998. Faculty laiiugs of stduents' 
clinical competences in core clerkships in rhe third year of medical 
school, scores on Let sing examination.s, and re.sidcncy program 
directors’ ratings of comical competence were retrieved from the 
database of the Jefferson Longitudinal Study of Medic«al Educa- 
tion. 

The predictors (independent variables) included faculty ratings 
of students’ clinical compererces in six core i lerkships (family med- 
icine, internal medicine, obstetrics-gynecology, pediatrics, psychi- 
atry, and surgery). These global ratings are part of a detailed .as- 
sessment form that is completed by the clerkship coordinators at 
each sire. The global ratings of clinical competence m each clerk- 
ship were assigned lui a hve-ptum .scale currently designated as 5 



= “high honors,” 4 = “excellent.” 3 = “good,” 2 = “marginal,” and 
1 = “incorriplete” or “failure.” 

The criterion measures (dependent variables) included scores on 
USMLE Steps 2 and 3 and postgraduate clinical competence ratings 
for graduates who had given written permission for follow up (aboui 
75% of the graduating seniors). These ratings were assigned by di- 
rectors of the residency programs near the end of the first year, 
using a 33 -item rating form. This form measures three areas of clin- 
ical competence: “data gathering and processing skills” (16 items), 
“interpersonal skills and attitudes” (ten items), and “socioeconomic 
aspects of patient care” (seven items). Each item was rated on a 
four-point Likert scale, and ratings were av^craged within the three 
competence areas. Data have been reported in support of the mea- 
surement properties of this rating form, including construct validity 
(factor structure), the internal consistency aspect of reliability, and 
the criterion-related validity of the form."’’" 

Sa^re.s on the USMLE Step 1 were also used to adjust the out- 
comes for performance differences on this examination. Bivariate 
correlations and multiple regression analyses were used to examine 
the associations between rnting.s in medical school clerkships and 
rhe criteria. 

Results 

The bivariate correlations reported in Table 1 are all statistically 
significant (j) < .01). The highest correlations of .29 and .20 for 
clerkship ratings and USMLE scores were found between the in- 
ternal medicine clerkship and Steps 2 and 3, respectively. The low- 
est correlations of .17 and . 1 1 were ohserv'ed for rhe psychiatry- 
clerkship and Step 2 scores and for rhe surgery clerkship and Step 
3 scores, respectively. Larger correlations were obtained fur the in- 
ternal medicine, family medicine, pediatrics, and obsietrics-gyne- 
cology clerkship.s than for the psychiatry’ and surgery’ clerk.ships. 

The resiilrs of multiple regression analysis indicated that the 
shared variance between clerk.ship ratings and Step 2 scores was 
14% (R- = .14). The overlap was 7% for Step 3 score.s, 12% for 
postgraduate ratings in data gathering and processing skills, 11% 
for ratings in interpersonal skills and attitudes, and 9% for ratings 
in rhe socioeconomic aspects (T patient care. Each o( these rela- 
tionships was statistically significant (p < .01). 

Inspection of the standardised regression coefficients, beta 
weights, reported in Table 1 indicate that in a multivariate statis- 
tical model, competence ratings given in family medicine, internal 
medicine, and pediatrics elerk.ships contributed significantly and 
consistently to the prediction of all five criteiion measures (p < 
.01). The magnitudes of the standardized regresMon coefficients in- 
dicate that among these cleikships, ratings in the internal medicine 
clerkship had the largest unique contribution in predicting three of 
the five criterion measures. 

Ratings in the psychiatry clerkship contributed to the prediction 
of Steps 2 and 3 in the multivariate mixlel (p < .05), hut did not 
predict ratings of postgraduate clinical competence. Ratings in the 
.surgery clerkship had a unique contribution to prediction of Step 
.• 2, and TO ratings for data-gathering and prtKessing skills and inier- 
*♦ personal skills and attiriidcs. 

Additional analy.scs examined the total number of high-honors 
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Table 1. Summary Results of Correlational Analyses of Third-year Students' Clinical Competence Ratings in Six Core Clerkships and the Students' 
Scores on USMLE Steps 2 and 3 and their Postgraduate Clinical Competence Ratings* 



Clerkship 




USMLE 






Postgraduate Clinical Competence 




Step 2 


Step 3 


Data Gathering! 


Interpersonal! 


Socio8Conomic§ 


(r) 


P 


(r) 


P 


(r) 


P 


(r) 


P 


(0 


P 


Family medicine 


(.21) 


.11^1 


(.18) 


.081; 


(.23) 


.13^ 


(.18) 


.09^ 


(.21) 


.131; 


Internal medicine 


(.29) 


.19^ 


(.20) 


.12U 


(.27) 


.1511 


(.22) 


.11^ 


(.22) 


.10^ 


Obstetrics-gynecology 


(.20) 


.08^ 


(.11) 


.00 


(.20) 


.1111 


(.22) 


.131: 


(.18) 


.1011 


Pediatrics 


(.26) 


.111: 


(.19) 


.121! 


(.23) 


.101! 


(.23) 


.12^ 


(.20) 


.08H 


Psychiatry 


(.17) 


.06** 


(.14) 


.07** 


(.10) 


.01 


(.09) 


.01 


(.10) 


.02 


Surgery 


(.22) 


.081! 


(.11) 


.01 


(.18) 


.07** 


(.17) 


.081i 


(.15) 


.04 


Multiple R 


.38^ 


.27T 


.351^ 


.33^ 


.30T 



‘The total sample included 2.158 graduates ot Jefferson Medical College between 1989 and 1998. Bivariate correlations are shown in parentheses. Standardized regression 
coefficients (beta weights) are shown outside parentheses, Ail bivariate correlations are stalisticaliy signiticant (p < .01). 
t Competence ratings of postgraduate clinical skills In “dala-gathering and processing.” 
i Competence ratings of postgraduate clinical skills in "interpersonal skills and attitudes. " 

§ Competence ratings of postgraduate clinical skills in “socioeconomic aspects of patient care.” 

**p< .05. 



ratings earned by each student across the six clerkships. We clas- 
sified the numbers of high-honors ratings, which ranged from 0 to 
6, into the following three categories: 0 (48% of the sample), 1-3 
(48% of the sample), and 4-6 (4% of the sample). 

We examined the willingness of the residency program directors 
to offer further residency training to each resident at the end of 
the first postgraduate year in relationship to the number of high 
honors. Further residency, which is usually offered only to those 
who solidly meet the first-year training standards, was offered to all 
but 66 (5%) of the 1,401 graduates for whom data were available. 
We found that the proportion of graduates who would nor be of- 
fered further training was the highest (6%) among those with no 
high-honors rating in any clerkship, followed by those with one to 
three high-honors ratings (3%). All of the graduates w'ith between 
four and six high-honors ratings were offered further training. The 
association between the number of high-honors ratings and the 
offer of further residency training was statistically significant (xo» 

= 9A,p< .01). 

We conducted additional analyses by adding Step 1 scores lo the 
multiple regression models in predicting the five criterion measures 
reported in Table 1 to statistically adjust for differences in Step 1 
.scores. After adjustment, the competence ratings in internal med- 
icine, family medicine, and pediatrics significantly predicted Step 
2 scores; and competence ratings in family medicine and pediatrics 
significantly predicted Step 3 scores. The statistical control of Step 
I scores did not change the pattern of findings in multivariate re- 
gression analysis in which the ratings of competence in the core 
clerkships were the predictor and ratings of the three postgraduate 
clinical competence areas of “data-gathering and processing skills,” 
“interpersonal skills and attitudes, “ and “socioeconomic aspeers of 
patient care” were the criterion measures. 

Discussion 

Tlie present study examined the validity of clinical competence 
evaluations assigned by medical school faculty, which are often re- 
ported in dean’s letters of evaluation. Our findings suggest that fiic- 
ulty ratings are valid and arc useful in predicting performances on 
medical licensing examinations and clinical competence ratings in 
residency. Alihougii the faculty rating.s assigned in the internal 
medicine, family medicine, and pediatrics clerkships yielded 
stronger associations with the criterion measures than did those in 
the psychiatry’ and surgery' clerkships, the number of high-honors 




ratings that a student earned in all six clerkships was found 
to have a significant association with whether or not further train- 
ing was offered to the graduate at the end of the first year of resi- 
dency. 

It should be noted that although the correlation coefficients 
were all statistically significant, they were not large. All fell in 
the range of small to moderate effect sne estimates described by 
Cohen.*’ However modest in magnitude, the consistency of the 
results provides credible evidence in support of the validity of the 
ratings. 

Conclusions and Implications 

Medical schools want to help each of their graduates to obtain the 
best residency position commensurate with his or her qualifications. 
However, most faculty realize that it is shortsighted to prepare a 
dean’s letter that misrepresents a student’s medical school record or 
excludes relevant obser\^ations of the student’s performance. Ob- 
fuscation is counterproductive.*^ We found that the clerkship rat- 
ings for internal medicine, family medicine, pediatrics, and obstet- 
rics-gynecology were significantly correlated with criterion 
m.easures. These evaluations were significant predictors of perfor- 
mance in postgraduate training. Likewise, our findings indicate that 
the high-honors ratings of competence in core clerkships were sig- 
nificantly associated with residency program directors’ decisions to 
offer fiiither residency training. 

The largest correlations were obtained for ratings in the internal 
medicine clerkship. This could he due to the fact that our students 
spend 12 weeks on this, and only six w’ceks on the others. This 
expanded time in the internal medicine clerkship allows for more 
observations and broader evaluations by a larger number of faculty 
and residents that could contribute to an increased overlap bctv/cen 
this clerkship’s ratings and the criterion measures. 

The Association of American Medical Colleges recommended in 
1989 that the dean's letter be described as a letter of evaluation 
rather than as a letter of recommendation.'^ Many have followed 
this recommendation. Studies in a variety of settings have con- 
firmed that superior performance in medical school docs predict 
performance beyond medical schoi'>l.' *^' *^’ Our results should 
not only increase the confidence of the medical school faculty with 
respect to their evaluations, hut also reassure residency selection 
committees about the validity of evaluations in dean’s letters as 
predictors of clinical competence beyond medical school. Ever^' 
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medical school should he ctMnmitred to provide empirical support 
for rhc validity of information in its dean’s letters of evaluation. 

Corrosp^mJencc: Claw G»Ilahan, MD. Office, Jefferson MeJical Cc>llece, 

PhilnJclphia, PA I9107o853: c-mail: (clara.caIlahan@mail.ijii.eUu>. 
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• TRUTH AND CONSEQUENCES 



Modt’raror; Gu'tjndie Camp, PhD 



Do Students’ Attitudes during Preclinical Years Predict Their Humanism as Clerkship Students? 

JOHN C. ROGERS and LOUISA COUTTS 



There is an increased awareness of the importance of humanism in 
the medical school curriculum/ Most of the early work concerned 
teaching methods*’^ and measurement of humanism.' ■*'* Contem- 
porary work reinforces the critical role of empathy in humanism*'* 
hut also broadens the concept to include other values, qualities, 
and behaviors: authenticity, compassion, fidelity, integrity, respect, 
spirituality, and virtue/'"’’ To distinguish it from humanism, pro- 
fessionalism is characterized as accountability, altruism, commit- 
ment to excellence, duty and commitment to service, and honor 
and respect for others/*' The project we report here began over 
eight years ago, so compassion, empathy, respect, and considerate 
hiopsychosocial interactions are the conceptual cornerstones of the 
operational definition we used in this work; wc therefore defined 
the humanistic physician as one who 

1. respects the patient’s viewpoints and considers his or her 
opinions when determining health care decisions, 

2. attends to the psychological well-being of the patient, 

3. regards the patient as a unique individual, 

4- treats the patient in the context of his or her family and social 
and physical environment, 

5. possesses good communication and lisicning skills, 

6. engenders trust and confidence, 

7. demonstrates warmth and compassion, and 

8. is cmpathctic." 

Despite the considerable attention to humanism in medical ed- 
ucation. little is known about predictors of humanism in students. 
Knowledge about predictors of humanism could foster the evalua- 
tion and design of curricular innovations by Identifying attitudes 
that may be affected by educational inrer\'cntions. The objective 
of our w’ork was to identify potentially modifiable attitudes that are 
associated with students’ humanistic performance during clinical 
clerkships. These predictors may he important outcome measures 
for the curricular interventions, and may help assess the need fi.>r 
additional innovations. 

Methods 

Between 1992 and 1995, we had students complete attitude ques- 
tionnaires during a second-year required preclinical course and 
again during a required third-year clerkship in family and com- 
munity medicine. The third-year students also completed a clinical 
performance examination (CPX) where standardized patients (SPs) 
rated the students' humanism. 

Wc administered four previously developed attitude question- 
naire.s: (1) Physician Belief Scale,'* (2) Physician Reactions to Un- 
certainty,*'* (3) Risk in Clinical Practice,*' and (4) Decision Making 
Style.*'' Wc had students complete these instruments to provide 
students with feedback on their attitudes that might affect their 
clinical behaviors. Wc gave each student a confidential report with 
his or her score for each scale, the class’s average score and range 
for each .scale, and the normative averages and ranges from the 
instrument-development samples. The third-year students’ report 
included their second-year scores, so each student could reflect on 
any changes in attitude scores after experience in clinical rotaiiims. 
Wc considered the concepts measured by the instruments to he 
important tt> general clinical performance, and nor specifically or 



solely to he predictors of humanism. Similarly, we chose an avail- 
able measure of humanism that could he completed by SPs to give 
students feedback on this particular aspect of clinical performance. 

The Physician Belief Scale is a 32-item questionnaire about ways 
physicians who adopt a biopsychoscx:ial approach to patient care 
differ from physicians who do not. Responses arc recorded on a 
five-point Likert-type scale ranging from 1 = disagree to 5 = agree. 
The scale scores range from 32 (maximum degree of psychosocial 
orientation) to 160 (minimum psychosocial orientation). The au- 
thors of the instrument determined the internal consistency of the 
scale by Kuder-Richardson formula 20, which measures the extent 
to which the items reflect a single underlying construct. This uni- 
dimensional scale is highly internally consistent, with r = 0.88. The 
mean score of the 180 family physicians, psychiatrists, and inter- 
ni.sts in the development sample was 74.3 (SD “ 13.7). and that 
of the 99 family physicians, psychiatrists, internists, and pediatri- 
cians in the \'alidation sample was 72.1 (SD = 13.0).*^ Eight faculty 
family physicians in the Department of Family and Community 
Medicine had a mean score of 58.4 (SD = 10.1), 

The Physician Reactions to Uncertainty scale measures physi- 
cians’ affective reactions to uncertainty, which “seem to he a sig- 
nificant, ycr overlooked, dimension of patient care decisions and 
variations in practice patterns.”"'* This scale consists of two sub- 
scales derived from factor analysis: ( 1 ) Stress from Uncertainty suh- 
.scale (13 items) and (2) Reluctance to Disclose Uncertainty to 
Others subscale (nine items). Both suhscales use a six-point Likert 
rc.sponse scale from 1 = strongly agree to 6 = strongly disagree, wirh 
many items reverse-scored to prevent respimse-set bias. Tlie Stress 
from Uncertainty suhscalc ranges from 13 to 78 (the higher the 
score the greater the stress) and has a Cronhach alpha of .90. in- 
dicating excellent inrernal consisrency. The mean score for 428 
family physicians, general practitioners, general internists, medical 
subspccialisrs, and surgeons in the development sample was 44 (SD 
= 11). The Reluctance to Disclose Uncertainty to Others suhscale 
ranges from 9 to 39 (the higher the .score the greater the reluctance 
to disclose uncertainty) and has a Cronhach alpha of .75. indicating 
acceptable internal consistency. The mean score for the 428 phy- 
sicians was 23 (SD = 6).*^ 

The Risk in Clinical Practice questionnaire measures physicians’ 
general self-perceptions of le\'cls of risk aversimi/risk seeking and 
risk attitudes in financial, physical well-being, social, and ethical 
domains. One nine-point Likert scale item is used for each domain 
(1 = avoid risk/dangcr to 9 = seek risk/dangcr). In a study of lab- 
oratory usage, 12 family physicians completed the questionnaire, 
pn^ducing a mean score for each item: gambling for money (3.6, 
SD = 1.8), physical danger (4.6, SD = 1.4), hurting people’s feelings 
(re verse -scored, 5.3, SD = 1.9), profe.ssional norms (rcvcrsc-scorcd, 
6.2, SD = 1.7), risks to self (4-7, SD = 1.4), and risks to patients 
(3.8, SD = 1.6). Significant positive rank -order correlations were 
ohscrv'cd between: gamble and norms, risk self and gamble, risk self 
and danger, and risk self and norms.*' In the present study, a stu- 
dent’s aggregate risk score was the propf>rtiun of the maximum pos- 
sible score. A mea.‘*urc of internal consistency for this scale is not 
available. 

Tl'ic Decision-making Style is a 32-iiem questionnaire that mea- 
.siires the extent to which a person is intuitive i^r analytic in making 
decisiims. Each item presents a forced choice heiwcen two alter- 
natives. Score'- range from 0 to 32, with low scores indi«.aiing an- 
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Table 1. Medical Students’ Attitudes about Psychosocial Aspects of Care, Uncertainty, Risk, and Decision-making Style— Comparison of Students' 
Attitude Scores during their Preclinical and Clinical Years with Attitude Scores of Physicians in Instrument-development Samples and Correlation 
of Students’ Attitude Scores with Humanism Ratings by Standardized Patients during a Third-year Clerkship Clinical Performance Examination, 

Baylor College of Medicine, 1992-1995* 







Attitude-scale Scores 




Correlation of Attitudes 
with Humanism 


Attitude Instrument 


Physicians in 
Instrument 
Development 
Samples 


Students’ 
Preclinical 
Scores 
{n = 299) 


Students’ 
Clinical 
Scores 
{n = 366) 


Students' 
Preclinical 
Scores 
{n = 299) 


Students' 
Ciinic-al 
Scores 
{n = 366) 


Physician Belief Scale (range 32-160): higher score means lower 
belief in psychosocial aspects of care 


73.2 


72.2 


73.5 


-.262 
p < .001 


-.222 
p < .001 


Physician Reactions to Uncertainty (range 13-78); higher score 
means more stress from uncertainty 


44 


48 


46 


-.049 
p = .230 


-.010 
p = .424 


Physician Reactions to Uncertainty (range 9-39); higher score 
means greater reluctance to disclose uncertainty to others 


23 


29 


28 


-.042 
p = .265 


-.115 
p = .014 


Risk In Clinical Practice (range 1-9); higher score means more 
risk seeking 


.52 


.49 


.50 


.090 

p = .087 


-.016 
p = .378 


Decision-making Style (range 0-32); higher score means more 
Intuitive than analytic approach to decisions 




17.2 


17.1 


.174 

p = .004 


.021 

p - .347 



‘ We administered standardized attitude instruments to medical students during their second preclinical year and again during their third-year clerkship in family and community 
medicine. We also administered a clinical performance examination during the clerkship, where standardized patients rated the students’ humanism using a validated scale. We 
correlated the standardized patients’ humanism ratings with the students’ preclinical and clinical attitude scores. 



alytic approaches and high vscores indicating intuitive approaches 
to decisions and problems.’*' A measure of internal consistency for 
this scale is not available. 

Standardired patients (SPs) rated students’ performances during 
a CPX on inter\de\v style, history taking, physical examination, 
management negotiation, and patient education. The SPs com 
plcted an encounter checklist and a separate humanism question- 
naire for each student in each encounter. Interstation exercises as- 
sessed the students’ differential diagnoses, management plans, and 
identification of ethical principles. The CPXs were conducted dur- 
ing the second and fifth weeks of the six-week rotation, and each 
CPX had a minimum of five cases. Students completed between 
ten and 13 patient cases in the CPX, with ten minutes for the SP 
activity and five minutes for the interstation activity. The CPX 
interstation reliability alpha coefficients ranged between 0.43 and 

0.56. These moderate scores are comparable to the reliabilities rc' 
ported in the literature for CPXs that used similar numbers of cases 
and tested similarly wide varieties of clinical skills, such as those 
encountered in family medicine.*’”''^ A student had to complete at 
least one full CPX with a minimum of five cases to be included in 
this study, in order to have multiple SP ratings of humanism and 
hopefully a stable measure for each student. 

For humanism, rhe SPs rated each student at each CPX station 
using the eight'item, abbreviated Humanism Scale developed by 
Hauck et al.^‘ The full Humanism Scale is a 24'item questionnaire 
assessing whether a physician has a "sensitive, non-humiliating, and 
empathetic way of helping [a paricnr] deal with some problem or 
need” and correlates highly with patients’ satisfaction with physi- 
cian-related aspects of care.'^ The full scale includes the eight com- 
ponents listed in the first paragraph of this article, with three items 
each. The eighr^item scale has the following items (one per com- 
ponenr): 

1. This doctor seems to take a personal intcre.st in me. 

2. Even when my problem is .small, this dticror is amcerned. 

T 1 have confidence in this doctor’s decisions. 

4- This diKtor respects my beliefs. 




5. 1 would talk to this doctor if something were troubling me, 

6. This doctor rakes an interest in my home life. 

7. Tlais doctor is easy to talk to. 

8. This doctor seems to know what I am going through when 1 
tell him/her about a problem. 

In the development study, responses were recorded as an “x” on 
a line between strongly disagree and strongly agree. The response 
point was measured with a ruler and converted to a percentage of 
the total line (1 to 99 for each item). Tlie scale score was the mean 
for the 24 items. The development sample of 185 patients produced 
humanism scores ranging from 16 to 99 (1 to 99 is the possible 
range), with a mean of 75. Cronbach alpha (reliability coefficient) 
for the 24 items was .95; the eight-item scale with one item per 
component had a coefficient of .93. We used a seven-point Likert 
response scale (1 = stronglv disagree, 7 = strongly agree) for the 
eight-item scale, which had a Cronbach alpha of .96. To adjust for 
variability among SPs on these ratings, we normalized each SP’s 
humanism scores to the average for all SPs. 

Wc used SPSS for Windows® for data analysis to prexiuce Pear- 
son correlation coefficients. 

Results 

The students’ scores foi the attitude scales were quire similar t<^ 
those of the development samples of experienced clinicians for rhe 
Physician Belief Scale and rhe Risk in Clinical Practice Scale. The 
srudcnt.s appeared to have more stress from unccrraint>- and reluc- 
tance to di.sclosc uncertainty to others than the experienced cli- 
nicians in the development sample (Table 1). The standardized 
patient.s’ ratings of the students’ humani.sm also were quite similar 
to those of rhe development sample; when both scale scores arc 
ctm verted to a proportiim of the maximum po.ssihle score (mean/ 
maximum): development sample 75/99 = .76 and students 42/56 = 
.75. 

Tlie students’ preclinical Physician Belief Scale scores were sig- 
nificanily inversely correiared with clerkship humanism (Table 1). 
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Students who rated the psychosocial aspects of medicine lower 
(higher Physician Belief Scale score) had lower humanism scores. 
This relationship persisted for students’ psychosocial beliefs during 
the clerkship. Students’ preclinical decision- making style was di- 
rectly related to humanism, with more intuitive students having 
higher scores on the humanism scales, but this relationship was not 
stable into the clinical year. The students’ preclinical Reluctance 
to Disclose Uncertainty to Others scores were not significantly re- 
lated to humanism, but their clinical ratings were inversely related, 
indicating higher levels of humanism in students less reluctant to 
disclose their uncertainty. The students’ Stress from Uncertainly 
and their Risk in Clinical Practice scores were not related to hu- 
manism. 

Discussion 

These students’ attitudes about biopsychosocial aspects of care and 
risk in clinical practice appear similar to those of experienced cli- 
nicians, but students may be more stressed by uncertainty and re- 
luctant to disclose it to others than experienced clinicians. The 
standardized patients’ ratings of the students’ humanism were sim- 
ilar to the ratings the patients gave practicing physicians. These 
results lend some support to the legitimacy of using these instru- 
ments with students. 

The students’ preclinical attitudes toward the biopsychosocial as- 
pects of medical care are a potential predictor for their humanism 
on clinical rotations. The Physician Belief Scale was developed as 
a self-report instrument for practicing physicians, but may be useful 
as a predictive tool for students. Attitudes about uncertainty, risk, 
and decision making do not appear to be consistently related to 
humanism. 

These results are sensible considering the eight items chat com- 
pose the Humanism Scale. Tlie consistent relationship with the 
Physician Belief Scale indicates that students w’ho do not value 
highly the biopsychosocial aspects of care are not he able to display 
the interest, concern, and respect necessary for high ratings of hu- 
manism by standardized patients. The inconsistent relationship for 
the Reluctance to Disclose Uncertainty to Others and Decision- 
making Style suggests that students comfortable with sharing their 
uncertainty or those preferring an intuitive decision style may bo 
perceived by standardized patients as open about decisions, easy to 
talk to, or knowing what the standardized patients are going 
through. The concepts of Stress from Uncertainty and Risk in Clin- 
ical Practice aren’t as obviously related to the components of hu- 
manism and were not associated with humanism in this study. The 
concepts of uncertainty, risk, and decision making may still he im- 
portant for clinical performance in general but not for humani.sm 
in particular. On the other hand, the concepts inherent in bio- 
psychosocial attitudes seem to be related to clinical performance of 
the concepts of humanism w-c measured in this study: compassion, 
respect, empathy, and especially considerate biopsychosocial inter- 
actions. 

The limitations of this work include questions a'oout kith the 
internal and external validity’. Ir is unclear how the number of ca.sc.s 
completed by students could have affected the stability of the hu- 
manism measure or influenced the results. The potentially problem- 
atic reliability of the Risk in Clinical Practice and Decision-making 
Style scales could have contributed to the failure to detect rela- 
tionships between those scales and humanism. E\'en the significant 
correlations wc did observ’c are small, and account for little of the 
variability in humanism, so rheir statistical significance simply may 
he due to the sample size. Tlie generalizability of thc.sc results is 
limited, since our study involved only one instirurion with approx- 
imately two class cohorts of student.s. Studies in other medical 
schools with additional coheirrs of .students would determine 
whether these apparent associations are stable. 

Further instrument development may improve the ahili^ldOat- 
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titude measures to predict later humanistic behaviors. The Physi- 
cian Belief Scale seems to measure well attitudes associated with 
considerate biopsychosocial interactions. Since contemporary' defi- 
nitions of humanism include so many concepts (compassion, re- 
spect, empathy, integrity, fidelity, authenticity, spirituality, and vir- 
tue), we may need one comprehensive instrument, or several 
separate instruments, to measure the attitudes corresponding to the 
many facets of humanism. Demographic variables that may predict 
humanism, such as gender, could be included in predictive models, 
but the fundamental purpose of the work we report here is the 
identification of potentially modifiable predictors. Including dem- 
ographic variables may improve the explanatory power of a mul- 
tivariate model, but will it lead to an admission policy of prefer- 
entially selecting students with the unchangeable demographic 
variables positively associated with humanism? Or should we con- 
centrate our efforts on curricula and the attitudes that we may be 
able to influence? 

Curricula in many medical schools have courses emphasizing the 
physician-patient relationship and problem-based learning, which 
may influence the students’ preclinical attitudes found in our w'ork 
to he associated with humanism. Future work may reveal relation- 
ships between other attitudes, which are conceptually related to 
humanism, and students’ humanistic behaviors in clinical rotations. 
Measurements of both attitudes and behaviors may be important 
outcome measures for curricular interv'enrions, and may help assess 
the need for additional innovations. 

Oirresp<inucnce: Dr. )ohn C. Rogers, Baylor Family Medicine. SSlOGreenFn.ir. Houi- 
ton, TX 77005; e-mail: < jrogcrs@bcm.rmc.edu). 
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• TRUTH AND CONSEQUENCES 



Moderator: Gu'e7idw Camp, PhD 



Early Identification of Students at Risk for Poor Academic Performance in Clinical Clerkships 

SCOTT A. FIELDS, CWTHIA MORRIS, WILLIAM L TOFFLER. and EDWARD j. KEENAN 



Many medical schools have revised, or arc in the process of revis- 
ing, their curricula.’ The impetus for this curricular change has 
been dependent on many factors. These factors include grant ini- 
tiatives emphasizing tlie development of curricula to promote 
generalism and the Association of American Medical Colleges’ 
Medical School Objectives Project (MSOP), as well as sig''ificanr 
shifts in the health care system, such as the growing influence of 
managed care. The more innovative curricular rc\'isions to date 
have included multidisciplinary, integrated courses with longitudi- 
nal curricula and early clinical experiences throughout the first two 
years (the preclinical curriculum). 

Oregon Health Sciences University (OHSU) School of Medi- 
cine implemented its curriculum revision in 1992.^ The result of 
this effort was the restructuring of the first two years of the curric- 
ulum from 24 specific discipline-based courses to ten interdiscipli- 
nary units. One of the units, the Principles of Clinical Medicine 
(PCM), is a longitudinal two-year course composed of small-group 
activities half a day each week and a weekly half-day clinical pro- 
ceptorship. In addition, there are nine integrated basic science 
courses in the first two years, and a onc-W'cek course, Transition to 
Clerkship, occurs at the end of the second year. The core clerkships, 
constituting ihe entire third year, include medicine, surgery, ob- 
stetrics-gynecology, family medicine, psycbiatiy, pediatrics, and ru- 
ral primary care. Each of these clerkships is six weeks in duration 
with the exception of medicine, which occurs in two six-week 
blocks. 

The premise for this study was that early identificatiiin of medical 
students who are at academic risk provides a basks for iniervcntion 
with individualized remedial programs. Previously, studies have in- 
vestigated predictors of performance for years one and two of med- 
ical school.^ ”’ Little has been done to address early identification of 
students at risk for academic difficulty in the third year of medical 
schcH.)!. The hypothesis was that performance in PCM during the 
predominantly pre-clinical curriculum of the first two years predicts 
students at risk for academic difficulty in the clinical clerkships. 
Accordingly, this .study analyzed the relationship between param- 
orers of student assessment, including a number of admission, cur- 
riculum, and standardized testing criteria, and an accepted .‘Standard 
ot graded performance in the third-year core clerkships. 

Method 

The sample studied was a cohort of students beginning with tbtisc 
who matriculated at OHSU from 1992 to 1995 and who graduated 
between 1996 and 1999. Student data were available from OHSU 
databases; no major change in curriculum, grading policy, or cal- 
culatior. of student grade-point averages occurred during rhc.se 
years. In the study, all individual student performance data were 
treated as confidential. 

The primary' outcome of this analy.sis was perfonrumce in the 
core clinical clerkships of the third year curriculum, which scn'cs 
as a critical component of the residency application prtx:ess. All 
cnurse.s at OHSU, including clerkships, are graded as humors, near 
honors, satisfactory, marginal, or fiiil. Grade-point average (GPA) 
in year three was used as the outcome, with a GPA of 3.0 repre- 
senting homirs; 2.0 near honors; 1 .0 satisfactory, and 0 marginal/ 
failure. After initial analysis as a continuous variable, we identified 
the knvcst quintile (T performance in year three (GPA < 2.0). 
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A number of potential indicators were considered to predict per- 
formance in year three. These indicators included cumulative col- 
lege GPA, separate MCAT scores (Verbal Reasoning, Biological 
Science, Physical Science, and Writing Sample), year one and year 
two basic science course performance as a mean percentage ex- 
amination score, perfonnance in the PCM course, and USMLE 
Step 1 score. The total MCAT score combined the Verbal Reason- 
ing, Biological Science, and Physical Science scores. For the Writ- 
ing Sample, the alphabetic score was coded from 4 to 15, with M 
= 8. Total points for the PCM course were used. In PCM, there 
are 80 points available for each of six quarters: 20 points for the 
clinical preceptotship, 10 points for small-group discussion activi- 
ties, 10 points for patient examination acrivities, 10 points for an 
essay, 10 points for written exam, and 20 points for a group objec- 
tive structured clinical examination (GOSCE).‘ 

The first series of analyses were univariate, with all continuous 
predictor variables correlated with the primary outcome, year three 
GPA. Subsequently, a parsimonious logistic regression m^xlel was 
fit to predict this outcome using forward selection procedures. The 
odds of low performance (year three GPA < 2.0) were estimated. 
Cutoff points for categorizing each continuous predictor variable 
were based on the lowest quintile of each score or percentage, with 
latitude for tie.*;. The significance of each predictor variable w’as 
assessed using a likelihood-ratio test statistic obtained from a logis- 
tic regression model fir to the outcome status. 

Results 

In tntdi, data for 306 students were available. All data were com- 
plete except for one .■student who had attended a college without 
grades, seven students who had taken the earlier \’crsion of the 
MCAT, and two students whose USMLE scores w^crc unavailable. 
Cot relation coefficients were obtained for each performance in- 
dicator as compared with the year-three GPA. Of all variables, this 
outcome was most significantly related to the score in the PCM 
course (r = .61, p < .001); year two percentage performance (r - 
.54, /> < .001); year one percentage periormance (r = .52, p < ,001); 
and USMLE 1 .score (r = .47, p < .001). Year-three GPA was only 
modestly related to undergraduate GPA (r = .19, p < .05) and 
MCAT Writing Sample score (r = .16, p < .05), and was not related 
significantly to the total MCAT score, or the Biological Science, 
Physical Science, and Verbal Science Subscores. Figure I show’s the 
relationship between the year-three GPA and the PCM sct're. 

Prior to logistic regression analysis, the relationship of each var- 
iable to performance in the lowest quintile of year-rhree GPA was 
analyzed in t^rdcr to determine the accuracy of prediction. Each was 
dichotomized by the lowest quintile and ctmiparcd in a 2 X 2 table 
with low year-three GPA. A sc<ire in the Kwesr quintile of PCM 
(< 380) correctly predicted low year three pcrlurmanccs of 38 of 
68 students (positive predictive value = 56%). Of 238 students who 
had score above the lowest quintile, 212 (negative predictive value 
- 89^Jo) had year-three GPAs above the lowest quintile. The.se val- 
ues were similar considering performance in rhe lowest quinrile In 
year two (positu’c predictive value = 53‘.''o, predicting 36 of 68; 
negative predictive value = 89"ii, 211 iff 238), A USMLE Step 1 
score in the lowe.sf quintile (< 190) correctly predicted 28 of 68 
students who scored in the lowest quintile of ycar-ihrec GPA (pos- 
itive predictive value = 41%), whereas a score above the lowest 
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Figure 1. Scatterploi of ihe relaiionship between Principles of Clinical Medicine 
(PCAA) scores and year three grade^point averages at Oregon Health Sciences 
Umversity School of Mcdicir\e, 1992-1999. 



quintile predicted 206 of 238 (negative predictive value = 87%). 
No other variable performed similarly in univariate analysis. 

A multivariate logistic regression model significantly predicted 
low yeat'three GPA (p < .001). (See Table 1.) Overall, perfor' 
mance in the lowest quintile in PCM was associated with a 9.45 
times increased risk of performance in the lowest quintile of year- 
three GPA (95% Cl, 4.71-18.98). Similarly, performance in the 
lowest quintile of year two conferred a 6.39 times risk of low year- 
three GPA (95% Cl, 2.96-13.80). This model also included per- 
formance in the lowest quintile of the USMLE Step 1, although 
this was not significant (relative risk 1.83; 95% Cl, 0.84-3.99). 
Last, performance by quintile of PCM score, after adjustment for 
USMLE Step I score and year-two percentage score, was linearly 
related to year-three GPAs less than 2.0 (p < .001). This confirms 
that the PCM score has a stnmg, graded relationship to perfor- 
mance in the clinical clerkships. A student receiving a PCM score 
in the second lowest quintile was 75% less likely (relative risk 0.25) 
to perfomi poorly in the clerkships, as compared with those with 
scores in the lowest quintile. A PCM score in the highest quintile 
was associated with a markedly reduced chance of poor perfor- 
mance; in fact, only one student in the highest quintile had a year- 
three GPA below 2.0. 

Discus.sion 

This study may be considered unique to the OHSU curriculum, yet 
many schools are developing similarly integrated curricula with 
early clinical experiences. This mcrliod of identifying students early 
in medical school who arc at risk for academic and professional 
difficulties may be general izahlc. 

In designing this study, wc defined the outcome of interest as 
year- three GPA. As mentioned, this was chosen due to the estab- 
lished connection between clerkship evaluavitins and residency ap- 
plications. However, ir is important ro remember that the validity 
of third-year clerkship evaluations as an indicator for performance 
as a physician is unclear. 

Our cimcUision is that performance in the PCM course is pre- 
dictive of performance in the core clerkships of the third-year cur- 
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Table 1. Logistic Regression of Performances In the Lowest Quintiles in 
the Seven Third-year Core Clinical Clericships at Oregon Health Sciences 
University School of Medicine, 1992-1999 





Coefficient 


Standard 

Error 


Relative 

Risk 


95% 

Confidence 

Interval 


Lowest quintile of PCM* 
score (^ 380} 


2.24 


0.36 


9.45 


4.71, 18.98 


Lowest quintile of year 
two % performance 
(< 82%) 


1.86 


0.39 


6.39 


2.96, 13.80 


Lowest quintile of 
USMLE Step 1 score 
(s 190) 


0.60 


0.40 


1.83 


0.84. 3.99 



* PCM = Principles of Clinical medicine course. 



riculum. Additionally, by identifying students who perform in the 
low*est quintile of the PCM course, it is possible to identify the 
students who will be in the lowest quintile of the core third-year 
clerkships. One explanation of this relationship is that evaluation 
of performance in the PCM course better assesses the ability to use 
core knowledge, as well as the evaluation of patient care skills and 
professional attributes. Consequently, the assessment of student per- 
formances in the PCM course may coincide more closely with the 
approach to evaluation used in the core third-year clerkships. That 
is, greater emphasis is placed on preceptor and interactive session 
evaluations, with only a relatively small component related to per- 
formance on didactic written examinations. Thus, performance in 
PCM reflects behavioral and attituJinal factors associated with pa- 
tient care, in addition to knowledge and skills, in contrast to a 
singular focus on cognitive performance as is typically the case in 
the basic science courses. 

A concern raised in regard to the PCM course is the “subjective” 
nature of the performance assessment, particularly in comparison 
with the “objective” process employed in the basic science courses. 
This study supports the current evaluation approach in PCM that 
includes assessments by small-group facilitators and preceptors and 
GOSCE performance in a fashion that is more consistent with as- 
sessment methods used in the third-year clinical clerkships. 

Overall, wc believe that this information has improved the fac- 
ulty's confidence in their ability to evaluate student performance. 
Previously, there were very few students identified as having diffi- 
culties prior to the third year, and these students rarely were noted 
ro have professional development issues. Tlie outcome of this study 
has already influenced the student assessment process in the OHSU 
School of Medicine. Validation of early concerns alx>ut student 
performance provides faculty with greater confidence in early iden- 
tification of students who are at academic risk. Confidence in this 
evaluation system has initiated changes in the medical schools Stu- 
dent Progress Committee s approach rt> considering at-risk students. 
A professional development evaluation has been established that is 
used to identify early concerns regarding professionalism or con- 
cerns related to clinical skills or attitudes despite the fact that a 
student may have succcs.sfiilly passed the courses. Thus, students 
who are succeeding in the basic science curriculum hut who are 
struggling with clinical integral u>n or who are not demonstrating 
appropriate professional development may he reviewed by the Stu- 
dent Progress Committee. 

Additional factors noi considered in the curreni study, including 
ago, gender, ethnicity, and years between matriculaiion and grad- 
uation from college, may have also contributed to the va’-iatiems 
observed in this p<-)pulation. Further analysis is needed to rcsidvo 
the potential inllucnces of any of these additional factors. 

Finally, confidence in identifying student's at n.^'k early in the 
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curriculum provides the opportunity for remediation at a time that 
is more conducive to improving the long-term success of the stu- 
dent. This creates the need for a process of developing individu- 
alized programs to address the specific shortcomings identified. 
However, an essential aspect of such a process is validation of an 
early academic warning system, as demonstrated in this study based 
on assessment during a longitudinal clinical experience in the first 
two years of medical school. As more medical school curricula now 
include early clinical experiences, the opportunity exists for con- 
firmation of these findings through multi-site studies. 

Correspondence: Dr. Scott A. Fields, Department of Family Medicine, Oregon Health 
Sciences University. 3181 SW Sant Jackson Park Road, Ponland, OR, ‘?7201; e-mail: 
{safielJs@ohsu.edu>. 
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• THOUGHTS ON THINKING 



Moderator: Glenn Rege/ir, PhD 



The Under-weighting of Implicitly Generated Diagnoses 

KEVIN W. EVA and LEE R. BROOKS 



Imagine that a diagnostician is asked to comment on a diagnosis 
proposed by a colleague. Clearly, to decide whether or not the di' 
agnosis is probably correct, other diagnostic possibilities for that 
case must be considered. However, the prevalence of confirmation 
biases recorded in the psychology literature suggests that the pro- 
posed diagnosis has some priority over self-generated diagnoses.’ 
Considering the proposed diagnosis first might lead to noticing and 
taking seriously the features consistent with it and evaluating other 
diagnoses in that light. The current study was designed to address 
this issue by e.xamining differences in probability ratings and pa- 
tient management decisions as a function of whether diagnostic 
alternatives are presented explicitly or are generated by the diag- 
nosticians themselves. Normatively, there should be no difference 
in the probability assigned to, or the patient management decisions 
made on the basis of, a diagnostic alternative regardless of whether 
it was suggested by someone else or was self-generated. In fact, for 
at least some levels of expertise, the source or explicitness of the 
diagnosis might be important in how thoroughly it is considered. 

It is well documented that the probability rating assigned to a 
particular diagnosis tends to be greater when that diagnosis is pre- 
sented in isolation relative to when it is presented within a list of 
alternative diagnoses (the unpacking effect).^ Fot example, the 
rated probability that a person will die of cancer tends to be greater 
when cancer is considered by itself than when presented within a 
list of differential diagnoses. Our previous work has shown, some- 
what counter-intuitively, that the alternative diagnoses that have 
the greatest influence on the probability assigned to a focal diag- 
nosis are those that are most likely to have been considered even 
in the aKsence of their explicit presentation.’ That is, the magni- 
tude of the unpacking effect (i.c., the decrease in the probability 
assigned to the focal diagnosis) was greater when the unpacked 
alternatives were believed to he highly plausible by independent 
experts relative to when the unpacked alternatives were believed 
to be less likely. While this result suggests that participants under- 
appreciated diagnostic alternatives that they themselves generated 
relative to when the same alternatives were explicitly presented, 
the experimental design did not allow us to be certain rhat partic- 
ipants had actually considered the diagnoses that were rated as most 
likely. Five diagnoses were explicitly presented in the unpacked 
condition, thereby allowing the possibility that participants had not 
generated all of the plausible diagnoses while reading the case his- 
tory. 

The current study was dcsigiied to demonstrate the same result 
for alternatives that participants claimed to have actually consid- 
ered. Furthermore, we attempted to maximize the probability that 
a specific alternative diagnosis would come to mind even when not 
presented explicitly by using clinical cases previously shown to have 
two highly likely and roughly cquiprobable diagnoses.^ Both ma- 
nipulations should eliminate the unpacking effect if diagnosticians 
evaluate diagnostic possibilities that they themselves generate in 
the same way as diagno.scs that are explicitly provided. 

While subjecrivc estimates of probability arc believed to pnu'ide 
a valid measure of participants’ clinical decision-making processes, 
it is possible that the act of assigning probabilities is a formal ex- 
ercise that is not closely related to actual practice. So, the current 
study also sen'ed as an attempt to demonstrate that the unj^adcing 
effect is not restricted to numerical estimates of probability by ex- 
amining whether or not patient management strategics, such as the 



requesting of diagnostic tests, are influenced by the explicit presen- 
tation of diagnostic alternatives. That is. if diagnosticians request 
more tests upon being presented tw-o highly diagnostic alternatives 
relative to being presented just the focal diagnosis, we would have 
converging, and perhaps more ecologically valid, evidence that 
there is a tendency to under-weight alternatives that are not ex- 
plicitly provided. Redelmeier et al.^ have previously shown that the 
likelihood that fourth-year medical students will order a CT scan 
upon the presentation of a potential case of sinusitis \vas influenced 
by the number of alternative diagnoses that were explicitly men- 
tioned. The current study attempts to fijrther ensure the robustness 
and generalizability of their findings by using multiple cases and a 
more extreme manipulation. 

We tested this design initially on medical students primarily be- 
cause of the ease of obtaining them as participants, but we believe 
that this initial step provides data of interest. Numerous biasing 
studies in medicine have confirmed that both experts and novices 
tend to be susceptible to the same heuristic-induccd errors. ’ ” Un- 
derstanding .he mechanism underlying such processes might allow 
insight into -he source of any errors that are made e\‘en as the 
number of errors a diagnostician makes undoubtedly decreases with 
the development of expertise. 

Method 

Pama/xims. The participant pool for this study consisted of sec- 
ond-year medical students from McMastcr University’s graduating 
class of 2001. A sample of tutorial leaders asked their students 
whether they w'ould participate. Those who agreed were run 
through the experiment in their tut 'trial groups during two sessions 
separated in time by an average of eight days (range 4-H days). 
Twenty' students participated in four groups, but follow-up data 
could not be collected for one of the students, leaving 19 with 
complete sets of data. Upon completion of the second group ses- 
sion, participants were paid $20 and given feedback on lx>th the 
clinical cases used and the purpose of the study. 

M^irerwis. Participants were presented ten case histories, each of 
which was followed by one or tw'o diagnostic h^'pothescs and a 
series of five questions. (1) “Given the case history that you have 
just read, please assign a number between 0 and 100 indicating 
how likely you think it is that the case hisroiy' is rcprc.sentatl\'e of 
the given diagnosis(es),’’ In all conditions participants were told 
that the diagnoses were mutually exclusive and that the inclusion 
of an “all other diagnoses” alternative meant that each list was 
exhaustive, thereby indicating that the sum of the ratings assigned 
should be 100%. (2) “Are there any diagnostic tests rhat you would 
like to see performed tcy aid you in your decision? If yes, please list 
them." (3) “While reading the case history', did you consider any 
diagnosis apart from those listed above? If yes, please state the di- 
agnosis that you consider to be the most likely differential.” (4) 
“Please rare your confidence (on a scale of I to 100) that you know 
the correct diagnosis.” (5) “Please rate the typicality of thi.s case 
on a scale of 1 to 100.” The latter two questiotis were intended to 
serve as dummy variables that would increase the likelihood that 
participants would not romemher the exact probability assigned to 
any particular que.srion. 

Procedure. Each of the ten cases has been sh<iwn to be sugge,sti\'e 
of two diagnoses, both highly likely and roughly cquiprobable 
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diagnoses.^ One of each pair of diagnoses was randomly selected to 
be the focal diagnosis — the diagnostic alternative chat would be 
presented with its associated case histoi^ across all conditions. In 
working through all ten cases, each subject was shown five cases 
within each condition (i.e., focal diagnosis alone versus focal + 
alternative diagnosis) randomly mixed together. Approximately one 
week later each participant was shown the same ten cases and asked 
to rate the original altemarive(s) together with the alternative they 
had generated in response to question three. If no alternative had 
been generated, participants were simply shown the original alter- 
natives a second time. Apart from adding the alternatives partici- 
pants had generated during the first pass, the questionnaires used 
during the two sessions were identical. 

Results 



In completing all ten cases, 190 observations were generated that 
could be analyzed for the unpacking effect — a decrease in the prob- 
ability assigned to a focal diagnosis upon the explicit presentation 
of additional diagnoses. Table 1 presents the average probability 
assigned to the focal diagnosis as a function of condition. First, a 
2 (session) X 2 (number of alternatives presented during pass 1) 
X 2 (diagnostic alternative: generated versus nor generated) X 10 
(case) ANOVA was performed. A significant effect of “number of 
alternatives” (F( 1,156) = 12.365, p < .01) revealed that the prob- 
ability assigned to the focal diagnosis was higher when presented 
in isolation than it was when presented in conjunction with a sec- 
ond diagnosis even though the alternative diagnosis was the most 
likely differential. An effect of “diagnostic alternative” was also 
found (F( 1,156) = 6.009, p < .02), thereby indicating that partic- 
ipants rated the focal diagnosis as more likely when they did not 
generate a plausible alternative diagnosis, as indicated by their re- 
sponses to question 3. Case was the only other effect that reached 
significance (F(9,156) = 4.746, p < .01). 

To further demonstrate that the unpacking effect occurs even 
when the unpacked alternatives are diagnoses that the participants 
had already considered, we performed a 2 (session) X 2 (number 
of altemarives presented during pass 1) X 10 (case) ANOVA on 
only those observations in which a diagnostic alternative had been 
generated, that is, using only the data presented in the Alternative 
Generated column of Table 1. The main effect of “number of al- 
ternatives” persisted (F(l,139) = 8.861, p < .01). In addition, a 
main effect of session was found (F( 1,139) = 16.375, p < .01), 
which indicates that the probability assigned to the focal diagnosis 
was lower in session 2 than in session I even though the only 
difference between the two sessions was the explicit presentation 
during session 2 of the diagnoses that the participants claimed ro 
have considered implicitly during session 1. Case was, once again, 
the only other effect that achieved significance (F(9,139) = 4.323, 
p < .01). The effect of session was not observed when the same 
analysis was repeated for trials in which the participants did not 
generate a diagnostic alternative (i.e., using only the data presented 
the Alternative Nor Generated column of Table 1). This indicates 
that the effect was not simply a result of the passage of time. The 
numbers of observations in these cells were small, hut examination 



of the means suggests that, if anything, the probability assigned ro 
the focal diagnosis increased in session 2 relative to .session 1 if no 
diagnostic alternative had been generated during session 1 (F(ld7) 
= 0.017, p > -85). 

We also examined whether or not the phenomenon being illus- 
trated by the probability ratings might influence management strat- 
egies by asking our participants to list the diagnostic tests that they 
would he interested in seeing performed. Participants requested 
more tests when two diagnoses w'crc presented (mean = 3.464) than 
when the focal diagnosis was presented in isolation (mean = 2.989; 
F( 1,187) = 4.938, p < .05). This result sugge.srs that the explicir 
presenration of diagnoses can influence the management strategic* 
of diagnosticians in addition to altering their rating of another di- 



agnosis’s likelihood. 



At;Al>kMK; Mkdk IM. Voi. . 75, 



Table 1. Mean Probability Ratings (and Counts of Numbers of 
Observations) Assigned to the Focal Diagnosis across Condition, by 
Second-year Medical Students at McMaster University, 1998-99* 







Focal Diagnosis 






Diagnosis(es) Presented 


Alternative 

Generated 


Alternative 

Not 

Generated 


Overall 


Session 1 
Focal 


44.39 (89) 


60.83 (6) 


45.43 


(95) 


Focal + alternative 


33.86 (70) 


38.80 (25) 


35.16 


(95) 


Overall 


39.75 (159) 


43.06 (31) 


40.29 (190) 


Session 2 

Focal (+ generated alterna- 
tive it generated) 


37.13 (89) 


59.17 (6) 


38.53 


(95) 


Focal - 1 - alternative 
(-•- generated alternative 
if generated) 


30.07 (70) 


42.00 (25) 


33.21 


(95) 


Overall 


34.03 (159) 


45.32 (31) 


35.87 (190) 



* Each of 19 medical students reviewed ten clinical cases in sessions one v/eek apart. 
See text tor details. 



Finally, the effect of the number of altemarives presented on 
confidence ratings and typicality ratings were analyzed. No effect 
of session or “number of alternatives” was found for either of these 
two variables. 

Discussion 

These results support the notion that individuals tend to under- 
appreciate self-generated diagnoses relative to diagnoses that are 
explicitly presented. The participants rated the originally presented 
diagno.sis as less probable when the alicrnative they had claimed 
to he considering implicitly was provided in a more explicit man- 
ner. That is, the unpacking effect was found even when the diag- 
nostic alternative that was unpacked was one that our participants 
claimed to have considered while originally viewing the case. 

While differences in the probability ratings assigned to the focal 
diagnosis across condition might appear small relative to the 100- 
point .scale u.sed, it is important to note that the functional range 
of potential responses was actually substantially smaller than 100. 
As mentioned earlier, the cases were originally designed to be in- 
dicative of tw'o diagnoses, which are both highly likely and roughly 
equiprobable. Consistent with that manipulation, our participants 
were hesitant to assign a very' high or a ver>’ low likelihood rating 
to any individual diagnosis. The effect size across packed versus 
unpacked versions of the questionnaire was 0.46 — a medium-sized 
effect"' — even though substantial effort was invested to ensure that 
the cards were stacked in favor of the null hypothesis. 

That being said, the mechanism that causes individuals to under- 
weight alternatives that are not explicitly presented remains in 
question. As alluded to in the inrroducti<m, the effects ohser\*cd 
might arise as a result of confirmation bias, as the explicit presen- 
tation of a diagnostic alternative might cause diagnosticians to dif- 
ferenrially process the evidence relevant to the diagnostic possibil- 
ities. This could arise in at least two ways that are not necessarily 
exclusive of one another; the explicir presentation of a diagnostic 
hypothesis might influence both the search for and the consrrual 
of evidence. Support for the plausihihry of rhe-jc hypotheses is wide- 
spread. 

For example, it has been found that, when given the opportunity 
to select additional information (i.e., prevalence data), medical stu- 
dents,’' residents,*^ and physicians" rend to seek data that are rel- 
evant to a single disease w’hile ignoring information that is related 
to equally plausible differential diagnoses. Tliis biased search for 
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information need not be proactive in that it does not necessarily 
take place while the diagnostician gathers novel information. On 
the contrary, Anderson and Pichert have shown that retrieval of 
information from memory' is also influenced by the context within 
which the search takes place.* ^ Wher4 asked to recall information 
about a house, the type of information participants were able to 
remember was dependent on whether they had been asked to read 
the story from the perspective of a burglar or a home buyer. When 
subjects were later asked to adopt the opposite perspective, they 
were able to recall more information that simply had not been 
available during the first memory task. A plausible extension of this 
result is that the explicit presentation of a diagnosis might bias the 
memorial retrieval of features present in the case history. 

Furthermore, maintaining an initial focus on the diagnosis that 
is explicitly presented might make it difficult to realize that non- 
discriminating symptoms provide evidence for more than one di- 
agnostic alternative. For example, considering the nausea and vom- 
iting w’ith which an 18-year-old woman with right-lower-quadrant 
discomfort presents as indicative of appendicitis might blind an 
individual to the possibility that these symptoms can also be con- 
strued of as clinical manifestation- c.' ... .vie inflammatory disease. 
Nonnan, Le Blanc, and Brooks have provided evidence that sup- 
ports this notion by reporting the mere presentation of a di- 
agnostic alternative can influence the interpretation of classic clin- 
ical features.’ Reinterpreting these features in light of self-generated 
diagnoses could prove to be difficult. 

Regardless of their cause, the data presented here indicate that 
the meaning of the verb “to consider” should not necessarily be 
taken at face value. Having considered the plausibility of a diag- 
nostic alternative can mean anything from having had the term 
come to mind to having performed a comprehensive analysis of the 
evidence for and against that particular diagnosis. Asking our par- 
ticipants to assign a probability rating to the likelihood of diagnoses 
that they claim to have considered was sufficient to decrease the 
probability that they were willing to assign to the focal diagnosis. 
This strongly suggests that the evidence in favor of the self-gen- 
erated alternative was underappreciated relative to when attention 
was focused on that alternative explicitly. Further research is re- 



quired to determine whether or not particular strategies can be 
adopted to prevent such under- weighting. 
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• THOUGHTS ON THINKING 



Moderator: Gknn Regehr, PhD 



The Impact of Structured Student Debates on Critical Thinking and Informatics Skills 

of Second'year Medical Students 

STEVEN A. LIEBERMAN, JULIE M. TRUMBLE, and EDWARD R. SMITH 



Yet it has become increasingly difficult to keep abreast of and to as- 
similate the investigative reports which accumulate day after day. 

. . . (O)nc suffocates . . . through exposure to the massive hexly of 
rapidly growing information. 

— Bernhard von Lanoenpec:k, Addres.'. at the Fir>ir Qmgress of 
Surgery-, April 1C* 1872* 



Among its many facets, the field of medical informatics encom- 
passes the use of technology to access and manage scientific infor- 
mation. The A.ssociation of American Medical Colleges (AAMC), 
through the Medical Informatics Objectives of the Medical School 
Objectives Project (MSOP) has identified five informatics-related 
roles of the physician and has established objectives for each of 
these roles. The lifelong learning role incorporates skills relating to 
information retrie\'al, evaluation, and reconciliation. Without ac- 
tivities specifically targeting these skills, it is an act of faith that 
students will graduate with adequate preparation in these areas. 

To explicitly address these curricular goals, second-year students 
in Endocrinology and Reproduction Course at the University of 
Texas Medical Branch in Galveston were required to participate in 
debates on controversial topics in these fields. This exercise pro- 
vided a structured task for developing and improving skills in lir- 
erature searching, critical thinking, including evaluation of the 
quality' of studies, reconciling results of conflicting studies, team- 
work, formal pre.scntarion and communication, and spontaneous 
scholarly discussion. 

A search of the Medline database produced only one article de- 
scribing the use of student debates for acquiring content and de- 
veloping critical thinking and communication skills in health sci- 
ence education.' The paper describes a first-year pharmacy 
curriculum that incorporated debates on socioeconomic topics rel- 
evant to pharmacy practice. While these debates required critical 
analysis c>f issues, the primary focus was on content rather than 
cognitive or informatics-related skills.* 

Published accounts of debates in a college chemistry’ course' and 
businc.ss schooP provide qualitative descriptions of the beneficial 
effects of such exercises on critical thinking, updating knowledge, 
and communication skills. In a more quantitative approach, Allen 
ft ah’’ conducted a mcta-analysis of the impact of formal instruction 
in communication skills (including debates) on critical thinking 
ability’. Such training resulted in 44% increase in scores on test.s of 
critical thinking. Compared with training in other communication 
skills, participation in "forensics" (i.c., competitive debates) pro- 
duced the greatest improvement, although the differences did ni^t 
achieve statistical significance.^ Finally, Johnson et al.^ performed 
a mcta-analysis of the effects of a method they call "academic con- 
troversy" on a variety of cognitive outcomes. Tliis method, which 
shares many features of the debate.s described in the current report, 
has produced "increased achievement and retention, higher-qualiiy 
problem-solving and decision-making, more frequent creative in- 
sight, more thorough exchange of expertise, and greater task in- 
volvement" by .students.^’ 

The current report descrihe.s the implementation of structured 
debates and the ev'aluatitin by students and faculty of the degree 
to which the informatics objecuves were accomplished. 



Method 

The 174 second-year students were divided into six sections of ap- 
proximately 30 students each and were assigned lo teams of three 
within each section. The debate topics represented areas of contro- 
versy in endocrinology and reproductive science. Students received 
assigned topics at the beginning of the course, and each student 
participated in one debate. When not presenting, students were 
expected to attenil their section’s debates. Each team researched 
background information, identified the main issues, found and an- 
alyred relevant studies, developed arguments on both sides of the 
topic, developed a “rational compromise" after weighing the evi- 
dence, and prepared to present each side of the topic. 

Each team was assigned to present the pro (supporting) or con 
(opposing) perspective immediately before the debate. Each of the 
six students gave a five-minute presentation of one of the following 
segments: pro background and arguments; pro supporting data; con 
background and arguments; con supporting data; pro "rational com- 
promise”; con "rational compromise." All team members partici- 
pated in a ten-minute rebuttal segment prior to the "rational corn- 
prom ise.s” and a ten-minute quesrion-and-answer session after the 
final presentation. A faculty moderator kept the session on sched- 
ule, participated in the qucstion-and-answer portion, and evaluated 
the students’ perfonnances. 

The students were assessed individually on presentation skills, 
contributions to the rebuttal and question-and-answer segments, 
and professionalism. The teams were evaluated for the accuracy of 
information and appropriateness of conclusions, and on written 
summaries and references turned in at the debate. Individual and 
team scores were combined to generate a letter grade for each stu- 
dent. 

Debates presented in each of the six student sections in a given 
week generally revolved around a single theme in order to provide 
similar learning experiences for all students attending. For example, 
one .set of debates addressed related facets e>f the role of insulin 
resistance in producing disease: (1) Hyperinsulincmia causes hy- 
pertension; (2) Insulin resistance increases the risk of thrombosis; 

(3) Insulin resistance increases the risk of coronary artery’ disease; 

(4) Insulin resistance causc-s the pfilycysiic ovary syndrtime; (5) 
Obesity s an independent risk factor for coronary artery disease; 
(6) Intensive treatment of type 2 diabetes mcllitus lowers the risk 
of coronary’ artery disease compared with conventional treatment. 
Other themes included menopaasal hormone replacement therapy, 
HIV infection and pregnancy, growth hormone therapy in non- 
growth-hormone-deftcient children, and the diagnosis and manage- 
ment of thyroid and parathyroid neoplasia. 

The effectiveness of this exercise was evaluated by three modal- 
ities. First, all students (n - 174) were requested to complete a 
suiA'cy following their debates. Second, faculty mtxleraiors (n = 17) 
were surveyed to obtain their impressions <if the students’ skills and 
the educatkmal value of the debates. Finally, two focus grtiups of 
randomly selected students (n = 4 per group) met with facilitators 
midway through the cour.^e to discuss the debai»'s and other course 
aspects. Tlie facilitators were educators not directly in\'olved in the 
course. Summaries and anonymous comments from the focus gnnips 
were reviewed and appro\ed by the studcni.s. 

Tlic siudeni and faculty questionnaires were parallel instruments 
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Table 1. Student Self-assessments and Faculty Ratings of Skills Developed during Preparation and Presentation of Structured Student Debates, 

University of Texas Medical Branch at Galveston, 1999-2000* 







Student Self-assessment 












Percentage of 
Students 
Improving by: 


Faculty 












Skill 


Before 


After 


1 Level 


>2 Levels 


Rating 


A. Weighing conflicting information from multiple sources and reconciling the 












differences (MSOP Informatics Objective A.3x) 


3.23 (0.13) 


4.31 (O.lO)t 


32.5% 


30.8% 


3.80 (0.49) 


B. Critically reviewing published research (MSOP Informatics Objective A.3.d) 

C. Discriminating between types o1 information sources in terms of currency, 


3.09 (0.12) 


3.97 (0.08)t 


34.2% 


23.9% 


3.00 (0.45) 


format, authority, relevance, and availability (MSOP Informatics Objective A.3.b) 
D. Recognizing factors that influence the accuracy/validity of information (MSOP 


3.23 (0.12) 


4.15 (0.08)t 


29.9% 


26.5% 


3.80 (0.37) 


Informatics Objective A.3.a) 


3.26 (0.12) 


4.17(0.08)t 


29.1% 


25.6% 


3.40 (0.40) 


E. Making evidence-based decisions (MSOP Informatics Objective A.4.c) 

F. Expressing the relative risks and benefits of outcomes/treatment options 


3,62 (0.12) 


4.39 (0.08)t 


30.8% 


20.5% 


3.20 (0.49) 


(MSOP Informatics Objective B.5.b) 


3.41 (0.12) 


4.22 (0.1 1)t 


28.2% 


22.2% 


3.83 (0.17) 


G. Medline searching (MSOP Informatics Objective A.2.a & b) 

H. Knowledge of cost-benefit issues in health care (MSOP Informatics Objective 
E.2.Q) 


3.72 (0.13) 


4.65 (0.08)t 


18.8% 


28.2% 


4.50 (0.22) 


2.94 (0.12) 


3.67 (0.1 1)t 


28.2% 


18.8% 


3.33 (0.33) 


1. Ability to make formal presentations (MSOP Informatics Objective C.2) 


3.75 (0.12) 


4.35 (0.08)t 


29.9% 


14.5% 


5.00 (0.00) 


J. Maintaining a healthy skepticism about the quality of information (MSOP 






20.5% 




3.17 (0.54) 


Informatics Objective A.4.b) 


3.73 (0.12) 


4.42 (0.09)t 




20.5% 




K. Using multiple sources for problem solving (MSOP informatics Objective A.4.a) 


3.93 (0.11) 


4.49 (0.08)t 


28.2% 


12.8% 


4.17 (0.31) 


L. Impromptu reasoning skills (the ability to “think on your feet”) 


3.82 (0.11) 


4.15 (0.09)t 


23.9% 


5.1% 


4.17 (0.31) 


M. Working effectively as a team to accomplish tasks 


4.57 (0.09) 


4.91 (0.08)t 


23.9% 


4.3% 


5.33 (0.21) 



•Students were asked to indicate on a scale of 1 through 6 (1 = complete novice; 2 = minimally competent; 4 = moderately competent: 6 = expert) their “skill level on each of 
the following both BEFORE and AFTER the debate." 114 of 174 students (65.5%) completed the survey. Data are means :t SEM. 
tp < 0.0001 by paired t test for comparisons of "before" versus "after." 



and were administered following the debates. Section A asked the 
students to use a three-item scale — “major resource,” “minor re- 
source,” and “not used” — to describe the importances of ten re- 
source types. Faculty had one additional category, “can’t judge.” 
Section B addressed 13 specific objectives of the debates, 11 of 
which corresponded to skills identified in the MSOP (Table 1). 
Students used a scale from 0 to 6 (0 = not used/not applicable; 1 
= complete novice; 2 “ minimally competent; 4 = moderately com- 
petent; 6 = expert) to retrospectively rate their pre- and post-debate 
skills. The faculty scale replaced “not used” with “can’t judge.” 
Section C used Likert-like scales to assess the importances of skills 
fostered by the debates, and the usefulness and timing of debates 
for promoting skill development. Finally, section D asked for com- 
ments and suggestions. 

Results 

Of the 174 participants, 1 14 (65.5%) responded to the surv'ey. They 
did not differ from the non- respondents with regard to individual 
debate scores (33.6 ± 0.2 versus 33.6 ± 0.3, p > .9), team debate 
scores (29.2 ± 0.3 versus 29.0 ± 0.4, p > -7), or scores on the final 
course exam (90.3 ± 0.8 versus 88.3 ± 1.4, p > .2). Six faculty, 
three clinicians and three basic scientists, who had moderated 19 
of the 30 debates (63,3%), responded to the faculty survey. These 
six included all four who had moderated more than one debate. 

Among the students responding, 78 (67%) indicated that the 
skills acquired through the debates would be “important” or “very 
important” in their careers, while all six faculty rated the impor- 
tance of these skills in the highest category'. Seven students (6%) 
felt the skills would be “not important at all.” Sevenr/ students 
(60%) agreed or strongly agreed that the debate had been a valu- 
able learning exercise, while 23 (20%) disagreed or strongly dis- 
agreed. The students were evenly divided as to whether one (n = 
33), two (n = 35), or three-to-four (n = 33) similar exercises would 
he required to “promote adequate development” of the skills. Four 



faculty (66.7%) felt three or four times would be appropriate, one 
felt four to eight times would he needed, and one felt two times 
would be adequate. One faculty member and 61 students (52%) 
felt the preclinical years were the most appropriate place in the 
curriculum for such exercises. Seventeen students (15%) felt they 
should be limited to the clinical years, and five faculty (83%) and 
23 students (20%) indicated that the exercises should occur 
throughout the four-year curriculum. 

The results of the student and faculty surveys of skill develop- 
ment are presented in Table 1. The students’ self-assessments in- 
creased significantly for all skills, with mean ratings of post-debate 
skills generally near a score of 4, or “moderately competent.” How- 
ever, the increase in mean score was greater than one level for only 
one skill (weighing and reconciling conflicting information), while 
for two skills (impromptu reasoning; working effectively as a team) 
less than 40% of the respondents reported any increase. Although 
the sample sizes (114 students, 6 faculty) preclude statistical com- 
parisons, faculty ratings of student skills appeared lower than stu- 
dent self-ratings for all but four skills. 

Table 2 summarizes the students’ responses regarding resource 
utilization. Review articles (88.9%) and primary research articles 
(86.3%) were most frequently identified as major resources. 

Focus-group summaries corroborated the generally favorable sur- 
vey findings. Specifically, the debates w-ere perceived more as ex- 
ercises in critical thinking than as exercises in content acquisition, 
had been effective in promoting literature -searching and research- 
analysis skills, and had been “interesting and enjoyable.” Tlie most 
common criticism was the amount of preparation time required. 
Comments from the faculty survey, while overall extremely favor- 
able, suggested several areas for improvement: students’ overreli- 
ance on reviews and published expen opinion, a tendency for stu- 
dents to want “to win” the debate rather than come to a balanced 
judgment based on the evidence, and the need to couple specific 
instruction in these skills with the debates. 
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Table 2. Numbers and Percenlages ot Second-year Medical Students 
Rating Informalion Resources as Major or Minor in Preparing for 
Structured Debates. University of Texas Medical Branch 
in Galveston, 1999-2000* 





Ratings 

Major Minor 

Resource Resource 


Not Used 


Resource 


No. 


% 


No. 


% 


No. 


% 


Review articles 


104 


88.9 


12 


10.3 


1 


0.9 


Primary research articles 
Systematic reviews/meta- 


101 


86.3 


15 


12.8 


1 


0.9 


analyses 

Practice guidelines/consensus 


52 


44.4 


46 


39.3 


19 


16.2 


statements 


28 


23.9 


48 


41.0 


41 


35.0 


Other textbooks 


23 


19.7 


66 


56.4 


28 


23.9 


Required course textbook 


15 


12.8 


80 


68.4 


22 


18.8 


Professional Internet sites 


13 


11.1 


40 


34.2 


64 


54.7 


Consultation with an expert 


6 


5.1 


53 


45.3 


58 


49.6 


Governments Internet sites 


5 


4.3 


21 


17.9 


91 


77.8 


Commercial Internet sites 


4 


3.4 


34 


29.1 


79 


67.5 



•Students were asked “Please indicate the importance ot each ot the following 
resources in preparing for your debate." Of 174 students participating in the debates. 
114 (65,5%) responded. 



Discussion 

This report describes the method and evaluation of struct cd stu^ 
dent debates for promoting the development of several cognitive 
and informaticS'rclated skills, many of which are embodied in the 
MSOR The data reported confirm that this exercise accomplished 
most of its goals. The central goal of promoting the development 
of skills in analyzing research studios and weighing and reconciling 
contrasting results was realized. The specific objectives related to 
this goal (A-D in Table 1) showed the greatest mean increases in 
self'ratings as well as the greatest proportions of students reporting 
improvement. Furthermore, primary research articles were among 
the two most important resource categories, corroborating the value 
of this exercise in stimulating critical analysis of research reports. 
However, the comparable emphasis on review articles raises con- 
cern that the exercise could deteriorate into general summaries 
rather than critical evaluations of the literature. In order to focus 
students’ attention on the primary' literature, the debate format and 
evaluation emphasized the use of data to support arguments. Some 
reliance on review articles was to be expected, as most students had 
neither extensive backgrounds in the topics nor much experience 
in reconciling conflicting research. Faculty' impressions were lower 
than the students’ self-ratings in these areas, especially with regard 
to “critically reviewing published research,” with mean ratings be- 
low the “moderaiely competent” level. Thus, at the completion of 



this exercise, the faculty perceived lower abilities and, therefore, a 
greater need for further skill development than did the students. 

The students also indicated improvement in literature searching, 
weighing risks and benefits of treatments, making evidence-based 
decisions, and understanding cost-benefit issues. For the other self- 
rated skills there were lower proportions of students improving and 
smaller increases in mean scores, although all increases were statis- 
tically significant. Although faculty assessments of most skills were 
low'er than students’, faculty rated the students at comparable or 
higher levels in literature searching, presentation skills, impromptu 
reasoning, and ceainwork. 

The retrospective nature of the student survey, in which the 
students rated both their pre- and post-debate skills after complet- 
ing the debate, may be viewed as a weakness in the study design. 
Nonetheless, the increase in scores indicates that the students felt 
the exercise did, in fact, promote skill development. While the 
significant increases in mean scores indicate progress in students’ 
skill development, the magnitudes of changes were generally small 
and the percentages of students reporting improvement varied by 
skill. These findings suggest that one such exercise is insufficient 
for adequate skill development. All faculty and most students ac- 
knowledged the importance of these skills and indicated that ad- 
ditional exercises were necessary. Conscnsu.s among faculty was for 
three or four exercises throughout the four-year curriculum, while 
the students’ varied recommendations are best summarized as two 
debates during the preclinical years. 

In summary, we have found that structured student debates 
among second-year medical students promoted development of crit- 
ical thinking and informatics skills identified in the MSOP Medical 
Informatics Objectives. A series of exercises distributed throughout 
the curriculum, targeting progressively more advanced skills and 
coupled to instruction in these skills, may achieve these objectives 
more fully. 

C^rresponJcncc and requests for reprints. Steven A. Licherman, MO. I\‘parimeni of 
Inretnal Medicine. Univcrsir>- of Texas Medical Branch. 301 University BlvJ., MRB 
8.138, Galveston, TX 77555-1060. 



References 

1. Strauss MB. Familiar Quotations. Btvsron: Little, Brown and Company. 1968. 

2. Poirier S. Active involvement of students in the leaminp priyrcss of the American 
healthcare system. Am J Pharmaceutical Educ. 1997;61:91-7. 

3. Streith ^cr HE. A method for reaching science, technology; and sixictal issues m 
introductory high schtxil and college chemi.^try classes. J Chem Educ. I988;65: 
60-1. 

4- Sch^ly^^.ie^ H. Ebert DG. Debates as a husines.% and stxiicty reaching technique. J 
Business Educ. 1983;58:266-9. 

5. Allen M, Berkowirz S, Hunt S, Louden A. A mcra-analysis of the impact of foren- 
sics and communication education on critical thinking. Communication Educ. 
1999;48:18-30. 

6. Johnson DW, John.son RT, Smitli KA. Aciidcmic Conitoversy; Enriching Oillege 
Instniction Through Intellectual Ctinflict. ASHE-ERIC Higher Education Repon. 
3rd cd. Washington, DC: The Ge<irge Washingtt*n University Graduate School of 
Education and Human Development, 1996:123. 



97 

A(.:.aplmk: MtiUtuNi-, V^r^.^5, 



S86 



No. lO/OcTOBFR SlirriFMENT 2000 




• THOUGHTS ON THINKING 



Mo<ierflror: Gft’iin Re^chr, PhD 



Critical Appraisal Turkey Shoot: Linking Critical Appraisal to Clinical Decision Making 

ALAN J. NEVILLE, HAROLD L REITER, KEVIN W. EVA, ;ind GEOFFREY R. NORMAN 



Since tlie publication of Physicians for the Twcnty-First Century 
— “the GPEP Report" of 1984, medical educators have identified 
the need for physicians to become lifelong learners.* Part of the 
impetus for this conclusion arises from several studies that have 
demonstrated that knowledge and/or competence of physicians de- 
cline as a function of time since graduation; the evidence indicates 
the cause to be failure to acquire new knowledge rather than a 
tendency to forget previously learned material.” Thus, physicians 
need to be trained to identify the relevant medical literature (i.e., 
information-seeking skills) and to apply “critical appraisal" tech- 
niques to analyze potentially useful articles culled from the litera- 
ture search. 

There is little published evidence that educational inter\ entions 
around critical appraisal teaching in undergraduate or postgraduate 
medical curricula impact in a sustained way the knowledge of epi- 
demiologic principles or the critical application of current research 
information for clinical decision making.' In considering the im- 
pact on conceptual knowledge, one could argue that there is a lack 
of validated tools available for evaluating critical appraisal skills; 
alternatively, the format of instruction, timing in the curriculum, 
and duration of instruction may be at fault. More important, studies 
have not addressed the issue of whether the demonstration of mas- 
tery of particular critical appraisal skills can be related to clinical 
decision making. Ultimately, such mastery* becomes largely irrele- 
vant if it does not translate into better judgment. 

The authors of this study were concerned that, despite the in- 
clusion in the first-year undergraduate curriculum of several focused 
objectives surrounding critical appraisal in the domain of clinical 
epidemiology, feedback from clinical faculty suggested that students 
had only rudimentary knowledge of rhe application of these prin- 
ciples at the end of the first year. In contrast to this feedback, 
problem-based learning (PBL) is believed to hold the potential to 
equip graduates with the skills to learn after graduation. In fact, 
several studies have shown significant differences between students 
of PBL and students of conventional curricula in the use of recently 
published medical literature.”*'^ With this inconsistency in mind, 
two experimental questions were asked. 

1. Are critical appraisal concepts to which students are “ex- 
posed" in PBL in earlier curricular blocks retained sufficiently to 
allow identification of mcthodologic errors in fonnal articles? 

2. Does awareness of such methodologic flaws transfer to an ap- 
preciation of how these errors might invalidate the conclusions of 
the journal articles’ authors? 

Ergo, the goal of this study was to investigate the relationship 
between understanding the concepts of critical appraisal and their 
application in clinical decision making. Understanding this rela- 
tionship can potentially improve the teaching of critical appraisal 
and the evaluation of this teaching. 

Methods 

Parricipants. This was a single-blinded experimental design study. 
The participant pool was composed of tw'o consecutive first-year 
undergraduate medical school classes (the graduating classes of 
1999 and 2000, respectively) in a PBL curriculum at •MCMaster 
University. Each class was composed of 18 tutorial gnnips of five 



to six students each. The students had some background in critical 
appraisal, as it had been studied in a readily identifiable manner 
during rhe first curricular unit at the beginning of the first academic 
year. For each class, the study took place during the third curricular 
unit running during the final three months of their first academic 
year. 

Maren'als. The subunit planners for each month-long subunir in 
that third curricular unit selected two journal articles from their 
respective expert domaiiis of gastroenterology', hematology*, and en- 
docrinology. These context experts chose articles that met the de- 
fined criteria of being (a) methodologically sound and (h) not di- 
rectly covered within the context of the unit’s curricular problems. 
Within each of the six articles so identified, one, two, or three 
different methodologic flaws were implanted, each flaw sufficiently 
egregious to warrant dismissal of rhe author’s conclusions. The 
methodologic flaws inserted related to concepts that students were 
expected to have come across previously in the curriculum. Six 
categories of errors were examined (participant assembly, random- 
ization, contrast, follow-up, analysis, and other). For example, the 
study group may have been inappropriately pooled or randomiza- 
tion might have been non-blinded. TTie text of the journal articles 
was retyped with the titles, tables, authors, and journal names ab- 
sent. After this was done, the original six “gold" articles and their 
mirror flawed counterparts, or “turkey" articles, were superficially 
indistinguishable. 

For each of the six articles a related clinical scenario was gen- 
erared that ivould prci»ent a clinical management pmblem for which 
a specific intervention was to be considered. Each problem was 
relevant to the unit of .study hut was not directly related to the 
health care problems in the curriculum and could not he answered 
using standard textbooks. Also, according to the subunit planners, 
the answers to the problems should have been obvious if the rel- 
evant recent literature was known. 

Procedure, Within both the class of 1999 and the class of 2000, 
students were randomly allocated biweekly to receive either a gold 
or a turkey article, for a total of six articles over 12 w'ceks. Ran- 
domization took place across the entire class, not by tutorial group, 
since the students worked on rhe exercise independently, and as- 
signment was by use of a table of random numbers. The students 
were all given a “pre-appraisal" response sheet with the appropriate 
clinical scenario and were asked to respond on an anchored .seven- 
point Likert-type scale whether they agreed or disagreed with the 
optional management or intervention suggested. The scale was an- 
chored berween “definitely yes" (1), “probably yes" (2), “probably 
no” (5) and “definitely no” (7). This pre-appraisal response sheet 
serv'ed as a baseline of the students’ knowledge of the condition 
demonstrated by the scenario. Tlie students were then given two 
weeks to work on the articles they had been assigned. Afterward, 
the students completed a “post -appraisal” sheet that presented the 
same clinical scenario and the same clinical question that they had 
.seen two weeks earlier. In addition, they were asked to identify any 
methodologic flaws in the articles they had read. For the class of 
1999. this identification took place using an open format. For the 
class of 2000. the identification of flaws was noted by ticking them 
off a checklist that contained 29 potential methodologic errors, 
three to six per category'. Response to the post- appraisal question- 
naire would allow us to estimate the students’ ability to detect 
methodologic flaws and to assc.ss whether or not the author's con- 
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elusions had influenced their clinical decisions. The responses were 
handed in to the tutor and a “tutor-guide” was provided to briefly 
explain the inserted flaws, thereby allowing discussion of the criC' 
ical appraisal issues during tutorials. 

Results 

Eight^^'nine of the 100 students in the class of 1999 completed both 
the pre- and the post^appraisal questionnaires for at least one of 
the six questions. The average number of completed questions per 
participant was 5.61, with 69 of the 89 students completing all six 
questions. In the class of 2000, 63 of the 100 students completed 
both pre- and post -appraisal questionnaires at least once, averaging 
5.68 questions per participant, with 50 of the 63 students com- 
pleting all six questions. The decreased participation by students 
in the second year reflected ambivalence on the part of some of 
the tutors in dealing with the logistics of the exercise. Two hundred 
and forty-six (49.3%) of the 499 observations collected from the 
class of 1999 and 186 (52.0%) of the 358 obser\^ations collected 
from the class of 2000 were from the gold arm of the studies, 
thereby indicating that the questions were not completed differ- 
entially for the two types of papers provided. 

Table 1 presents the mean pre-test and post-test scores for both 
the turkey and the gold groups of both classes. Upon coding the 
data, some scales were reversed so that the low end of the seven- 
point Likert scale was always the “correct” response. In neither class 
did the pre-test scores of the two groups differ significantly from 
one another. A 2(time: pre- vs. post-) X 2 (arm; gold vs. turkey) 
repeated-measures analysis of variance revealed a significant inter- 
action between time and arm (F( 1,497) = 7.043, p < (.01) for the 
class of 1999. The same analysis revealed an effect that bordered 
on significance for the class of 2000 (F(l, 356) = 3.273, p < .075). 
Planned comparison t-tests for both classes revealed the nature of 
these interactions. Mean post-test scores of both gold groups de- 
creased significantly relative to their pre-scores (tl245] = 5.198, p 
< .01 and t[185] = 4.834, p < .01 for the class of 1999 and the 
class of 2000, respectively^ In contrast, mean post-test scores of 
both turkey groups did not icveal a significant effect of time (i:[252] 
= 1.323, p > .18 and t[171] = 1.693, p > .09 for the class of 1999 
and the class of 2000, respectively). Therefore, students were more 
likely to change their management decisions in an appropriate di- 
rection if they had read a methodologically error-free version of the 
paper. 

The participants wto read the error-free gold version of the ar- 
ticle did report having found errors, as can also be observed in Table 
1, but they reported having found significantly fewer errors than 
those who read the turkey version of the article (i:[496] = -3.252, 
p < .01 and t[357] = -3.338, p < .01, for the class of 1999 and 
the class of 2000, respectively). Collapsing across arms, there was 
a significant positive relationship in both classes between the num- 
ber of problems raised and the post-score assigned (r = 0.230, p < 
.01 for the class of 1999, r = 0.344, p < .01 for the class of 2000). 
This indicates that the fewer errors raised, the lower (i.e., more 
correct) the post-score that was assigned. This relationship re- 
mained significant when the analysis was limited to rhe correct 
identification of the errors that had been planted within the turkey 
arricles (r = 0.163, p < .01 and r - 0.251, p < .01 for the classes 
of 1999 and 2000, respectively), TTiese analyses provide converging 
evidence that students were altering their management decisions 
based on the strength of the method that they perceived. In ad- 
dition, it is reassuring that the participants did not appear to allow 
their prior impressions of the appropriate management decisions to 
influence their critical appraisals of the arricles presented. This is 
evidenced by the lack of a relationship between the number of 
problems raised and the pre -score assigned (r = 0.016, p > .72 and 
r = 0.068, p > .19 for the classes of 1999 and 2000, respectively). 

Finally, taking into account the numbers of turkey articles read 



Table 1. Mean Responses to Patient Management Problems by Class and 
Type of Article* McMaster University 1997 and 1998 



Class 


Arm 


Pre-test Score 


Post-test Score 


No. of Errors 
Identified 


1999 


Gold 


3.764 


3.195 


2,398 




Turkey 


3.644 


3.506 


2.805 


2000 


Gold 


3.460 


2.929 


2.355 




Turkey 


3.496 


3.293 


3.255 



• * study conducted on two consecutive classes of first-year students. For each of the 
classes, mean pre-test (before reading the articles) and post-test scores (seven-point 
scale) reflecting agreement with the articles’ conclusions are given. Gold arm = students 
allocated the original articles; turkey arm = students allocated articles with methodologic 
flaws Inserted. 



and the numbers of errors embedded, the potential numbers of er- 
rors that could be correctly identified were 505 and 343 for the 
classes of 1999 and 2000, respectively; 178 (35.2%) of them were 
identified by the class of 1999 and 80 (23.2%) by the class of 2000. 
Review of the actual methodologic flaws identified by the students 
demonstrated no consistent pattern between the two classes. The 
proportions of the six individual error categories correctly identified 
by the class of 1999 were 33/86 (38%) for participant assembly, 37/ 
98 (38%) for randomisation, 63/163 (39%) for contrast, 11/45 
(24%) for follow up, 8/68 (12%) for analysis, and 26/45 (58%) for 
other. The corresponding proportions correctly identified by the 
class of 2000 were 13/56 (23%), 21/56 (38%), 26/113 (23%), 16/ 
29 (55%), 4/59 (7%), and 0/30 (0%) for the same six categories, 
respectively. 

Discussion 

An ultimate objective in teaching critical appraisal concepts is for 
medical students to view literature searching and critical appraisal 
as fundamental skills required for effective medical practice. As 
Norman et al. demonstrated in a recent review of teaching critical 
appraisal, most reported teaching interventions, even the few con- 
trolled studies published, have assessed short-term gains in acquir- 
ing knowledge of critical appraisal techniques rather than their ap- 
plication to clinical decision making.^ The resales of these studies 
were largely consistent with the anecdotal feedback that we have 
received from tutors — students appear to be poor critical appraise.-s. 
While it is important to be able to demonstrate some knowledge 
of the principles of how to scrutinize the medical literature carefully 
and critically, some demonstration of putting these principles into 
practice would seem to be just as desirable an educational outcome. 
By using a more decision-oriented outcome measure, the current 
findings suggest that the studies reviewed by Norman et al. and the 
interactions between students and tutors might underestimate stu- 
dents’ ability to critically appraise scientific articles. 

This study demonstrated that first-year medical students can alter 
their clinical management decisions appropriately as a function of 
whether they have read a methodologically sound or flawed journal 
article. When provided with the “gold” journal articles, these stu- 
dents changed their clinical decisions in the pust-lest in the direc- 
tion of the correct management decisions, despite apparently iden- 
tifying some putative methodologic flaw's in these “gold” papers. As 
expected, however, fewer errors were identified by students in the 
“gold” articles, and there was a significant positive relationship he- 
tw’ecn identifying fewer errors and a.ssjgning a “more correct” clin- 
ical decision score on the post-test. 

Tile findings from the turkey arricles require more explanation. 
As expected, students identified more errors in the turkey papers. 
However, at most, only 35% of the deliberately inserted method- 
ologic flaws were correctly identified. Despite being unable to ac- 
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curately identify all of these errors, the students tended not to alter 
their original management decisions when they had been assigned 
turkey papers. It seems that the students were uncomfortable with 
the authors’ conclusions and, without necessarily being able to 
specify the flaws, decided to either maintain their original manage- 
ment decisions or make small changes in either direction. While 
the authors had anticipated from a curriculum review that the 
“flaws” inserted into the articles might be identified by students, 
one weakness of this study is that there was no assessment of the 
tutors’ abilities to identify them. 

Finally, there was no relationship between the number of flaws 
identified and the “correctness” of the scores the students assigned 
on the pre-test. This implies that the students were able to read 
the articles critically without being biased by their perceptions of 
the correct management decisions, thereby providing further evi- 
dence that our students treated the articles in a rational manner. 

In summary, the current findings show that our first-year students 
do indeed have relatively limited ability to identify specific meth- 
odologic issues in journal articles. Despite this, however, the clin- 
ical decision-making results demonstrated a gratifying relationship 
between the students’ perceptions of the “quality of evidence” and 
appropriate changes in their management decisions. This suggests 
that students are reading the literature more critically than might 
be assumed by simply testing their knowledge of particular critical 
appraisal concepts. That is, while seeming to treat articles appro- 
priately, students may not be able to articulate specific methodo- 
logic errors, thereby giving the appearance of poor critical appraisal 
skills. While it is important for students to be able to articulate 
critical appraisal concepts, the current results suggest that exam- 
ining students' abilities in this domain should take place in the 



context of clinical decision making. Our participants’ capacity to 
alter their decisions in a rational manner suggests that even novice 
medical students should be strongly encouraged to critically ap- 
praise. Future research will determine to what extent the correct 
or incorrect perceptions by students of particular methodologic 
flaws influences their clinical decision making. 

The authors thank Glenn Jones for generating the checklist that was used by the class 
of 2000 and Annette Schrapp for adnnnisirativc support in preparing and distributing 
the materials to study participants and tutors. 

Correspondence; Kevin Eva, Department of Psychology’, McMaster University Faculty 
of Medicine, Hamilton, Ontario L8S 4K1, Canada. 
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• AN OBJECTIVE LOOK AT OSCEs 



Moderaror; Sheila Chaiwin, PhD 



Communication Skills in Medical School: Exposure, Confidence, and Performance 

DAVID M. KAUFMAM, TONI A. LAIDLAW, and HEATHER MACLEOD 



Numerous studies indicate that, although communication skills can 
be learned, they can also deteriorate as students progress through 
medical school, particularly in the clinical years as students learn 
medical problem solving.*"^ The good news is that this deteriora' 
tion in communication skills can be prevented or reduced with 
more rigorous training. This was the surprise finding of Davis and 
Nicholaou,^ w'ho compared the communic;:tion skills of first' and 
fourth-year medical students. They found that fourth^year students 
had superior facility in communication skills, which is attributable 
to a greater emphasis on the importance of communication and 
increased training in the curriculum. To be effective, communica' 
tion training must provide bridges between theory, knowledge, 
practice, and exposure — with exposure providing students contact 
with patients through clinical observation and clinical consulra' 
tion. Students acquire the most effective interviewing skills when 
they interact with patients during their clinical training,' so ex- 
posurc to a wide variety of clinical situations is essential. Prior 
training for such clinical encounters helps students develop work' 
ing knowledge, understanding, and communication skills for deal- 
ing with challenging doctor-patient interactions.® Students must 
fulfill three conditions to demonstrate appropriate commuii:c=»*'ion 
skills.' First, they need to know and understand a minimum of the 
corpus of knowledge and theory underlying communication ex- 
changes in general and consultation processes. Second, they need 
to have a positive attitude towards using tliese skills in their inter- 
actions with patients. According to Bandura, ^ this attitude is best 
developed through positive role modeling. Third, students need to 
be trained in a repertoire of specific communication skills and tech- 
niques and be placed in situations where these can be practiced 
successfully with patients.’^ 

The purpose of this study was to examine students’ exposures to 
and confidence in communication skills, the relationship between 
exposure and confidence, and the relationship between exposure 
and performance of patient-doctor communication skills among 
students graduating from an undergraduate medical program. By 
exposure we mean observing, assisting, or performing the skill. The 
four categories of communication skills wc .studied were interview- 
ing, breaking bad news, crisis management, and counseling. We 
refer to the last three of these as “higher-order” skills, as they in- 
volve progressively more challenging and complex communication 
interactions. 

Background 

Preclerkship Curriculum (Years One and Tu'o). A problem-based 
learning (PBL) curriculum was begun at Dalhousie University Fac- 
ulty of Medicine in 1992. The primary vehicle used to instruct 
students in communication skills is a module on interxdewing skills 
in the first-year Patient- Doc tor unit. Students are videotaped in- 
terviewing a standardized patient, and they practice their skills in 
small groups. Tliey also receive lectures and written material. The 
students obser\'e and practice basic history taking in clinical set- 
tings in their first and second years. 

Clerkship Ciirriculum (Years Three and Four). At the time of the 
study, the clerkship comprised an 86-week continuum of experi- 
ence, with significant flexibility and student choice. The students 
received some formal training in communication skills during their 
family medicine and psychiatry rotations. However, in other rota- 
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lions, instruction occui’s in clinical settings on an ad hoc basis, 
without a fonnal curriculum, as needs are identified. 

Method 

The students in the sample comprised the first two classes (n = 
172) to graduate from the new PBL curriculum at Dalhousie 
(classes of 1996 and 1997). 

A locally-developed questionnaire was used to obtain students’ 
self-assessments of exposure and confidence. It consisted of four 
sections that asked students to indicate their levels of exposure to 
a set of ten communication skills (see Table 1). They also were 
asked to rate their confidence with respect to each skill, using a 6- 
cm visual analog scale with the ends marked "low” and "high.” 
This is a useful and rarely used approach to assessing students’ con- 
fidence in their skills. The rating scale for exposure consisted of 
the categories: never encountered, observ^ed only, assisted senior 
staff member, performed once, performed two or more times. Stu- 
dents in the classes Ol 1996 and 1997 also participated in a two- 
hour objective structured clinical examination (OSCE) with sim- 
ulated patients. However, three ten-minute communication 
stations were added to the 1997 OSCE, dealing with (1) requesting 
an organ donation from the husband of a woman declared "brain 
dead,” (2) counseling a middle-aged woman with depression, and 
(3) managing a 70-year-old woman brought to the emergency de- 
partment by her daughter after a fall. All students were rated in 
each station by a trained physician-examiner, using a standard rar- 
ing scale. 

The students took the uvo-hour OSCE on the day following 
completion of their final clerkship rotations at the end of medical 
school. While awaiting their results in a large room, they completed 
a series of questionnaires, including the one used in this study. 
Students’ identities were masked before coding to ensure confiden- 
tiality. 

The data were analyzed using a sratistical software package, and 
means, standard deviations, and frequencies were calculated. The 
five exposure categories were recombined into three: never en- 
countered, observed or assisted, and performed one or more times. 
This was done retrospectively so that the number of students in 
each category would be high enough for statistical comparison. Two 
one-way ANOVAs were run across these three categories of ex- 
posure, one to compare the students based on their confidence lev- 
els and the other to compare them based on their OSCE perfor- 
mances. 

Results 

The response rate for this study was 88% (148/172). Table I pre- 
sents the results for level of exposure and confidence. 

Nearly all students in both classes had taken a general adult 
history (99.3%) and a general pediatric history (97.3%). In fact, 
closer examination of the data showed that most students had per- 
formed these skills two or more times. The majority of the classes 
also had elicited, one or more times, a sexual hi.story (96.7%), a 
history of drug or alcohol abuse (94.0%), and a history of sexual 
or physical abuse (59.6%). With respect to the higher-order com- 
munication skills, smaller proportions of the classes had performed 
these at all: breaking had news to a patient or relative (S0,7%), 
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Ta 0 le 1. Levels of Exposure and Confidence in Communication Skills for the Dalhousle Medical School Graduating Classes, 1996 and 1997 (/is 148) 



Communication Skill 


Never 

Encountered 

(%) 


Level of Exposure* 

Observed or 
Assisted 
{%) 


Performed One 
or More Times 
(%) 


Confidencet 
Mean (SD) 
(%) 


Interviewing 


General adult history 


.7 


0 


99.3 


84.4 (12.2) 


General pediatric history 


1,3 


1.3 


97.3 


76.8 (16.5) 


Eliciting sexual history 


.7 


2.7 


96.7 


71.1 (19.5) 


Eliciting history of drug or alcohol abuse 


2.6 


3.3 


94,0 


76.0 (17.4) 


Eliciting histoiy of sexual or physical abuse 


21.2 


19.2 


59,6 


53.4 (26.9) 


Breaking bad news 


Breaking bad news (patient or relative) 


6.0 


43.3 


30.7 


51.7 (25.5) 


Crisis management 


Managing a patient exhibiting drug-seeking behavior 


16.2 


45.9 


37.8 


50.7 (25.5) 


Managing a violent or hostile patient 


13.5 


48.6 


37.8 


46.7 (25.8) 


Counseling 


Providing counseling for drug or alcohol abuse 


29.1 


41.7 


29.1 


43.0 (26.8) 


Providing counseling for victim of physical or sexual abuse 


55.6 


33.8 


10.6 


30.5 (24.5) 



• Scale categories were collapsed to create this table as follows: "observed only" and "assisted senior staff member” were collapsed to “observed or assisted." “Performed once,” 
and "performed two or more times" were collapsed to "performed one or more times." 
t Distance marked along the S-cm visual analog scale was converted to percentage of total length of scale. 



managing a patient seeking drugs (37.8%), managing a violent or 
hostile patient (37.8%), counseling for drug or alcohol abuse 
(29.1%), and counseling for victims of physical or sexual abuse 
(10.6%). 

The students in the graduating classes of 1996 and 1997 rated 
their confidence in interviewing relatively high for general adult 
history (84-4%), general pediatric history (76.8%), eliciting sexual 
history' (71.1%), eliciting history of drug or alcohol abuse (76.0%), 
and eliciting history’ of sexual or physical abuse (53.4%). In the 
areas of breaking bad news and crisis management, ratings were 
around or below 50% (sec Table 1). Lower confidence ratings were 
given to the counseling areas (i.e., drug and alcohol abuse (43.0%), 
physical or sexual abuse (30.5%). 



Since the complexity of the higher-order skills may have con- 
tributed to lower confidence, seven individual .skills were examined 
(see Table 2). The students in the graduating classes of 1996 and 
1997 were more confident as their levels of exposure increased for 
each communication skill. 

Confidence levels were higher for each of the seven skilb ex- 
amined for the group that had observed or assisted than they were 
for the group that had never encountered the skill. More dramatic 
differences were observ'ed between the group that had performed 
the skill one or more times than for the group that had simply 
observed or assisted. 

An ANOVA on the total score across the three OSCE com- 
munication stations (1997 class) w-as conducted for each of the 



Table 2. Self-ratings of Confidence in Communication Skills by Levels by Exposure for the Dalhousle Medical School Graduating Classes, 

1996 and 1997 {n = 148)* 



Level of Exposure Mean (SD) 



Communication Skill 


Never 

Encountered 

(%) 


Observed or 
Assisted 
(%] 


Performed One 
or More Times 
(%) 


F-ratiot 


Interviewing skills 


Eliciting history of drug or alcohol abuse 


31.6 (11.3) 


55.9 (14.1) 


78.3 (15.2) 


23.4 


Eliciting history of sexual or physical abuse 


28.4 (23.0) 


38.5 (17.0) 


67.3 (21.5) 


48.7 


Breaking bad news 


23.4 (11.9) 


36 9 (22.6) 


67.9 (18.1) 


50.3 


Crisis management 


Managing a patient exhibiting drug-seeking behavior 


14,6 (14,9) 


46.5 (18.9) 


71.5 (17.7) 


89.6 


Managing a violent or hostile patient 


22.4 (22.2) 


40.8 (20.8) 


66.4 (19.0) 


46.4 


Counseling skills 


Providing counseing for drug and alcohol abuse 


21,3 (19.8) 


38.1 (18.2) 


69.9 (20.8) 


68.8 


Providing counseling for victims of physical or sexual abuse 


18.0 (17.1) 


39.1 (20.8) 


66.1 (22.4) 


50.6 



‘Confidence ratings have been converted to percentage scores (O-lOOVo). The first three interviewing skills are not included since nearly all students 1ell into the third group 
(performed one or more times). 
tAII F-ratios are statistically significant, p < .001. 
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following exposure groups: low exposure across all ren communi- 
carion skills, medium exposure, and high exposure. The scores 
across :he three stations were combined in order to achieve a more 
adequate representation of the students* performances. Since skills 
are context-dependent, this combined score yielded a more valid 
and reliable outcome measure. In order to provide a more defensible 
measure for the \'ariable, exposure was defined as total exposure 
across all ten skills. We felt that a total exposure score would better 
represent students’ actual medical school experiences in doctor- 
patient communication. Tlie results showed that OSCE perfor- 
mances increased from the low-exposure group (n = 9; mean 
59.8) to the medium-exposure group (n = 80; r' m = 64-3) to the 
high-exposure group (n = 58; mean = 66.3). These differences were 
statistically significant (F = 3.1; p = .05). 

Discussion 

In this study, graduating medical students had higher levels of ex- 
posure to standard communication skills than to higher-order com- 
munication skills, and their confidence levels were lower for the 
higher-order communication skills. One possible akemarive expla- 
nation for the lower levels of confidence with respect to higher- 
order communication skills is that these skills are more demanding. 
Therefore, we decided to compare the confidence levels of students 
for each individual communication skill, as a function of type of 
exposure. The students who had performed each skill had much 
higher confidence levels than did those who had only observ'ed or 
assisted. Also, the students who had ohser\'ed or assisted v/ith the 
skill had much higher confidence levels than did the students who 
had not encountered the skill at all. However, our findings suggests 
that observing or assisting is insufficient to develop confidence to 
an educationally significant degree; the more substantial gains were 
obsen’ed when students had performed the skill one or more times. 

Although increased exposure increases confidence, a cmcial 
question is whether increased exposure also leads to improved per' 
formance in applying these skills. The results of this study showed 
that this is indeed the case. The students who had had more overall 
exposure to the ten communication skills in this study performed 
at higher levels on the three OSCE stations emphasizing commu- 
nication skills. Although not all ten communication skills were 
assessed in the three OSCE stations, these skills are composed of 
many common subskills, such as developing rapport, listening ac- 
tively, explaining, and planning. Students with more exposure over- 
all to the ten skills w'ould ha\’e developed these subskills to a 
greater extent, and would most likely have better applied them in 
the OSCE. 

It is important to note that students with less confidence in their 
abilities to exercise a skill may have avoided performing the skill 
in the clinical setting. Therefore, a causal relationship between 
exposure and confidence in this study should nor be assumed. Al- 
though the results of the study confirmed our hypotheses, the ex- 
posure scale used did nor measure actual level of exposure, i.e., 
number of times observed, assisted, or performed. Because the ex- 
posure scale simply measured students’ recall of exposures, some bias 
may have been intrtxluced. More importanr, the study surveyed 
only two medical school classes, and only one class’s performance, 
so a broader survey is needed to confirm our findings. 

The resulrs of this study indicate that undeigraduare students 
mny not be getting sulificient opportunities to obscrv’c and practice 






complex communication skills in clinical or classroom settings, 
which results in low confidence levels. Factors affecting students’ 
confidence do relate to clinical exposure, but they are also influ- 
enced by students’ training in communication skills through struc- 
tured programs that provide opportunities for learning and practice. 
The focus of this training appears to be on basic inuervdewing and 
interpersonal skills and not necessarily on higher-order skills. This 
was the case with the graduating classes in this study. All students 
had been given instruction in basic interviewing techniques. They 
had had opportunities to learn these techniques by observing videos 
and through role playing, by practicing their skills on each other 
and with simulated patients, and by receiving feedback on their 
skills from other students, course instructors, and simulated pa- 
tients. For the higher-order skills, the students had been given ex- 
posure to breaking bad news through video programs and discussion 
as part of their training in palliative care; however, they had not 
had the opportunity to practice and receive feedback in these skills, 
as had been the case in their inter\'icwing skills program. The stu- 
dents ’nad been given no classroom training in crisis management 
or active counseling skills. 

Both types of exposure may have to become more orchestrated 
for students during their undergraduate training. Providing effective 
training in higher-order communication skills as a core part of the 
undergraduate curriculum, where students have ongoing opportu- 
nities to observe, practice, and receive feedback in these skills, is 
a significant first step. This could occur in the clerkship years using 
the same techniques employed in learning basic interviewing skills. 
For this training, however, the use of role playing and standardized 
patients becomes panicularly important. Once students have prac- 
ticed these skills, they need to be provided with the opportunity- 
to use them under supen-ision in a clinical setting. This practice 
will require some faculty development to ensure that physicians 
have the necessary' skills to supen-ise effectively. 

CorresixinJcncc; Dr. David M. Kaufman. Divisiun of Medical Education. Chntcal Re- 
search Centre, Rtx-'in C-1 1 5, Dalhousic Univc^^itv, Halifax, Nov,^ Scotia. Ctnada BiH 
4H7. 
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• AN OBJECTIVE LOOK AT OSCEs 



Moderator: Sheila Chauvm, PhD 



Assessment of Residents' Interpersonal Skills by Faculty Proctors and Standardized Patients: 

A Psychometric Analysis 

MICHAEL B. DONNELLY, DAVID SLOAN, MARGARET PLYMALE, and RICHARD SCHWARTZ 



Tlie objective structured clinical examination (OSCE) has typically 
been found to be a valid and reliable method for assessing clinical 
knowledge and skills when evaluating performances of residents. 
For example, Sloan et ah* found a 19'problem, 38'Station OSCE 
to be reliable (r„ = .91) and valid in assessing the clinical skilk of 
56 surgical residents. 

Often, OSCE performance is summarized in an overall score, 
which may represent a combination of history, physical examina' 
tion, interpersonal and communication skills, technical skills, and 
organization. Interpersonal skills scores are sometimes reported sep' 
arately because of their importance in overall performance. Warf et 
al.‘ found that when faculty judges evaluated general surgery resi- 
dents' performances on a neurosurgical station there was no statis- 
tically significant difference between the junior and senior residents 
in performing the neurologic examination. Since general surgeiy^ 
residents do not receive training in neurosurgery beyond their in- 
tern year, it was not unexpected that there was no significant dif- 
ference between levels of training. However, the senior residents 
were judged to be competent significantly more frequently than 
were the junior residents. It was also found that interpersonal skills 
correlated significantly with both competence and level of training. 
This study suggested that interpersonal skills are a very important 
facet of clinical competence that differentiates between residents 
at different skill levels. 

Colliver et al,' also found statistically significant correlations (in 
the .30 to .50 range) between interpersonal skills and clinical com- 
petence. Similarly, Sloan et al.’’ found that global interpersonal skill 
judgments were moderately reliable and correlated highly with 
overall OSCE performance scores. Tlaus, it is clear that interper- 
sonal skills are highly associated with the judged competency of 
medical students’ or residents’ performances. 

Several studies have raised the question of who should evaluate 
interpersonal skills, a faculty proctor (FP) or the standardized pa- 
tient (SP). Given the increasing clinical demands on faculty time, 
it is important to know whether SPs can assess interpersonal skills 
as validly and reliably as faculty members. Cooper and Mira"^ found 
that, on average, SPs gave more positive evaluations of commu- 
nication skills of undergraduate medical students than did faculty 
members or other professional staff. They found that the commu- 
nication scores derived from faculty’s ratings did not correlate with 
the scores derived from the SPs’ ratings. 

Finlay et al.^’ assessed the communication skills of primary care 
physicians who had just received training in communication skills. 
Professional examiners and SPs ev^aluated the physicians’ commu- 
nication skills by means of a checklist. The two sets of scores cor- 
related between .40 and .50 on the different OSCE problems, in- 
dicating that the SPs' evaluations cannot be used interchangeably 
with the faculty's evaluations. 

In a test of the validity of eight faculty raters, Kalet et al.' vid- 
eotaped the performances of 21 year-two medical students. Faculty 
evaluated the interviewing skills of those students on two different 
occasions using a checklist. The correlations of the communication 
scores among faculty members were low. Furthemiore, the correla- 
tions between a faculty members evaluations of the inrcrv’iewing 
skills of the same students’ performances on two occasions were 
also low. I 

A related question is whether checklist scores or glohal^ratipjt. 



provide more reliable and valid measures of performance. Regehr 
et a!.® compared the psychometric properties of checklists with 
those of global rating scales on an eight-problem OSCE given to 
residents at all levels of training. They found better reliability and 
onstruct validity for global rating scales than for checklists. On 
the other hand, Hodges et al.'* also evaluated the comparative re- 
liability and validity of checklists and global ratings of communi- 
cation skills. They found high correlations between global ratings 
and checklists. 

Based on this review of the literature, we conclude that inter- 
personal skills arc an important component of clinical competence. 
Glo’nal ratings are at least as valid and reliable as checklist scores. 
However, the levels of reliability and validity of interpersonal-skills 
ratings have not been clearly established. Also, it is not clear 
whether faculty or SPs, provide the more valid and reliable eval- 
uations. The purpose of this study was to determine the psycho- 
metric characteristics of global interpersonal skills ratings of faculty 
proctors (FPs) and SPs. 

Method 

All 56 residents of a general surgery program participated in a 12- 
problem, 24-sration surgery' OSCE, ^ch OSCE problem was di- 
vided into two stations: Part A, in which a history and/or physical 
examination was performed or information was given to the SP, 
and Parr B, in which the resident tesponded to several short-answer 
questions about the patient or SP seen in Part .A. This study focused 
on the 12 Part A stations during which the FPs and SPs evaluated 
the residents’ interpersonal skills. 

Each of the 24 OSCE stations was five minutes in duration. At 
each station were either actual patients or SPs v/ho had been 
trained to act in a consistent manner. As pare of their training, 
they had been instructed in evaluating the residents’ interpersonal 
skills. They had practiced making these evaluations during their 
training, formally evaluating the interpersonal skills of a resident, 
who was also evaluated by the trainer. Tlieir evaluations were com- 
pared and the trainer and the SP discussed any differences in their 
evaluations. The typical training session lasted about one hour. 

During each resident-patient encounter, an FP checked off in- 
dicated behaviors as they occurred. At the end of each of the Part 
A stations, the faculty member evaluated the resident’s interper- 
sonal skills (along with several other global performance dimen- 
sions). (Note that the trainer had reviewed the checklist and global 
ratings with each FP immediately before the OSCE.) Faculty rated 
their level of agreement with the statement “Interacted effectively 
with the patient” (0 = “not at all” to 4 = “very much”). The SPs 
independently evaluated the residents’ interpersonal skills by telling 
the preceptors their ratings on the same five-point scale. 

In order to determine the similarity of FPs’ and SPs’ raiings, the 
following analyses were done. First, the reliability of each of the 12 
sets of paired FP and SP ratings was estimated by coefficient a. It 
was also estimated for the mean rating of the FP and the mean 
rating of the SP. Second, the reliability of the faculty’s ratings of 
the residents’ interjiersonal skills acro.ss the 12 statioas was esti- 
mated by means of coefficient a, and it was also calculated sepa- 
rately for the SPs’ ratings. Tlie Spearman-Brown formula was then 
used to estimate the expected reliabilities for two faculty' raters and 
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Table 1. Psychometric Properties of Faculty Proctors’ Ratings and Standardized Patients’ Ratings of General Surgery Residents’ 

Interpersonal Skills on a 12-Problem OSCE 



Station 


Reliability 


Mean 

Difference 


P- value Mean 
Difference 


Faculty 

Construct 

Validity 


Standardized 

Patient 

Construct 

Validity 


Rating of faculty and patients (mean) 


.92 


-0.16 


<.001 


.68* 


.73* 


Neurosurgery 


.94 


-0.18 


n.s. 


.30* 


.34* 


Postoperative care 


.85 


-0.53 


<.001 


.35* 


.32* 


Plastics 


.84 


-0.14 


n.s. 


.55* 


.59* 


Breast options 


.81 


-0.42 


<.001 


.57* 


.43* 


Head and neck 


.79 


-0.16 


n.s. 


.41* 


.56* 


Breast examination 


.72 


0.00 


n.s. 


.39* 


.19 


Thyroid 


.72 


-0.30 


<.003 


.39* 


.31* 


Computed tomography 


.70 


0.05 


n.s. 


.10 


.27* 


Leg ulcer 


.64 


0.04 


n.s. 


.05 


.31* 


Abdomen history 


.64 


-0.04 


n.s. 


.49* 


.36* 


Biliary colic 


.59 


-0.52 


<.001 


.25 


.44* 


Hypercalcemia 


.28 


-0.06 


n.s. 


.31* 


.21 



"p < .05 (construct validity). 



two SP raters. These Spearman- Brown estimates provide a stan- 
dard against which to judge the magnitudes of paired FP and SP 
reliabilities. 

It is possible to have relatively high reliabilities even though the 
FPs’ and SPs’ ratings may not be very closely calibrated. For ex- 
ample, an SP might be a more lenient evaluator than an FP. One 
indicator of similar calibration is that the mean rating of the FP is 
not significantly different from the corresponding mean rating of 
the SP. A two-way analysis of variance (faculty versus standardized 
patient, a “be tween-groups” factor; and comparing clinical prob- 
lems, a "within-groups” factor) and analy>cs of simple effects were 
used to determine whether the FPs and the SPs evaluated the res- 
idents’ interpersonal skills at approximately the same performance 
level- 

If the paired FPs’ and SPs’ ratings arc valid (convergent validity), 
they ought to correlate more highly with each other than with any 
other interpersonal skill rating. However, it is possible that this 
might not be the case. For example, faculty evaluations could cor- 
relate most highly with other faculty evaluations as SPs could with 
other SPs- To determine how the different ratings relate to one 
another, a hierarchical cluster analysis, using I — Pearson r as the 
similarity metric and the complete linkage amalgamation rule, was 
performed.”" Clustering methods represent a variety of procedures 
that identify how variables group (cluster) together. Cluster analysis 
joins variables together based on the magnitude of the inter-cor- 
relations among the variables. A cluster is defined by two or more 
variables that correlate more highly with each other than they do 
with the other variables. We chose hierarchical cluster analysis over 
factor analysis because hierarchical cluster analysi.s better represents 
the relationships among variables when most of the variables in- 
tercorrelate substantially with each other. This analysis indicated 
whether the interpersonal-skills ratings clustered predominantly by 
(1) clinical problem (FP and SP couplets) or (2) FPs .separately and 
SPs separately. 

Finally, if the interpersonal-skills ratings arc valid (construct va- 
lidity), senior residents ought to perform better than junic^r resi- 
dents and interns.^ To this end, Pearson’s correlations were calcu- 
lated between interpersonal skills ratings and postgraduate year. 
These analyses were carried out for the 12 OSCE stations and the 
acro.ss — station averages for lx>th FPs and SPs. Based on our ex- 
perience with validity studies such as this, wc expected the validity 
coefficients to he around .40 to .50. Fisher’s ^-test for differences 
between correlation^ was used to test whether the validity, Cbcffi- 



cients for the FPs and the SPs were significantly different from one 
another. 

Results 

TTic first data column of Table 1 presents the reliability coefficients 
for each of the paired (FP and SP) racings for each of the 12 sta- 
tions, and of the mean ratings of the FPs and SPs. The reliabiliry 
of the mean FPs’ and SPs’ ratings is high, .92. Tlie magnitudes of 
the reliabilities for the various stations vary': eight of these reli- 
abilities were above -70, two were in the .60s, one was in the 50s. 
and one was .28. 

The reliability of the faculty’s interpersonal-skills ratings for the 
12 OSCE stations was .77. The reliability was .74 for the SPs’ 
ratings. The Spearman-Brown formula was used to estimate w'hat 
these reliabilities would be if there were only two raters — to make 
them comparable to the paired (FP and SP) reliabilities. The es- 
timated reliability of two faculty raters was .36, and it was .33 for 
two SP.S. 

If the ratings of FPs and SPs provide fairly equivalent informa- 
tion about .he residents’ interpersonal skills, there should not be 
significant differences in their mean ratings. The two-way analysis 
of variance comparing the equality of faculty’s and SPs’ means and 
the equality' of means across problems indicated that ( 1 ) there was 
not a significant difference between the two groups (/> > .05), (2) 
there were statistically significant differences among the means for 
the various OSCE problems (p < .001), and (3) there were signif- 
icant interactions between groups and problems. 

Since the significant interactions made rhe interpretation of the 
main effects equivocal, analyses of simple effects were performed to 
identify' the exact pattern (ff differences. The second and third data 
columns of Table 1 summarize these analyses. As can be seen from 
this table, the differences in the mean FPs’ ratings and the mean 
SPs’ ratings (across the 12 stations) are statistically significant (p 
< .001). The mean difference is —0.16 on a five-point scale, in- 
tiicating that the FPs tended to evaluate the residents’ interpersonal 
skills at a slightly lower level than did the SPs. There were signif- 
icant differences between paired ratings (FP and SP) fi^r finir of the 
OSCEs. In those four cases, the faculty evaluated the rc.sidcnts be- 
tween three and five tenths of a scale point lower than did the 
SPs. In the other eight cases, the mean differences were small and 
not statistically significant. 

If the interpersonal -skills ratings of a given FP and SP on a par- 
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ticular OSCE are valid, they should correlate highest with each 
other; however, they should also correlate significantly with the 
other measures if interpersonal skills is genemUy a valid construct. 
A hierarchical cluster analysis was performed to determine whether 
the FP and SP rating pairs for each station clustered closest. This 
dyadic clustering did take place for nine of the 12 possible OSCE 
stations. In these nine cases, the pairs correlated highest with each 
other. On two of the OSCE problems, the dyadic pairs did not 
correlate most highly with each other for unknown reasons, and 
on one OSCE problem, the pair did not cluster with each other 
because of the lack of variability in the SP’s ratings. 

To explore further the similarities in the ratings, the intercor' 
relations were calculated among the 24 different ratings of the res- 
idents’ interpersonal skills. The median correlation among all 276 
pairs of ratings was .20 (range —.24 to .89). The median correlation 
among rb • faculty*s ratings was also .20 (range —.20 to .56), while 
the median was .17 (range —.16 to .50) for the SPs. In the case of 
the 12 paired correlations, the median correlation was .60 (range 
.20 to .89). 

The construct validity of the FPs' and SPs’ ratings was deter- 
mined using the construct of experience; residents -with greater ex- 
perience should interact more effectively with patients than should 
junior residents. Pearson correlations were calculated between in- 
terpersonal skills ratings and level of experience. The fourth and 
fifth data columns of Table 1 present these correlations for the 
faculty and the SPs, respectively. None of the OSCE’s paired cor- 
relations were significantly different from one another (Fisher’s t- 
test for paired correlations). The average FP’s rating (across the 12 
stations) and the average SP’s rating had higher construct validities 
than any of the individual interpersonal ratings. Construct validi- 
ties of .68 and .73 are very high. In the experience of the authors, 
construct validities usually do not exceed .50. Nine of the 12 con- 
struct validities for the faculty ratings were statistically significant, 
while ten were significant for the SPs. 

Discussion 

In this study, faculty proctors and standardized patients were asked 
to evaluate residents’ interpersonal skills at the end of each OSCE 
station. They made their judgments using a simple single-item 
scale. For the most part, the level of agreement (reliability) between 
the FPs and the SPs was adequate. On four of the stations, the 
reliabilities were sufficiently low- to minimize their usefulness in 
making decisions about performance competency. On the other 
hand, the reliabilities of the average rating of the FPs and the 
average rating of the SPs were very’ satisfactory’. Thus, it appears 
that these simple judgments are for the most part “reliable.” On 
the other hand, the variability in the magnitudes of the reliability 
cocfficient.s across stations suggests that one probably should not 
plan to make educational decisions about competency from per- 
formances at individual stations. Rather, it appears that one should 
use average performance measures. 

An important consideration in estimating the reliability of rat- 
ings of interpersonal .skills is whether to estimate reliability across 



problems or within problem pairs. The reliability of within-OSCE 
ratings is higher than that of berween-OSCE ratings. It may be that 
interpersonal skills, like clinical reasoning skills, are affected by the 
context of the clinical case. 

To explore this possibility, the 24 different ratings of the resi- 
dents’ interpersonal skills were intercorrelated. The median corre- 
lation among all possible combinations of raters, among the FPs 
and among the SPs was about .20. On the other hand, the median 
correlation among the paired ratings (FP and SP) w'as .60. Further, 
the hierarchical cluster analysis indicated that the interpersonal- 
skills ratings primarily clustered by OSCE station and not by rater 
type (FP or SP). This result has several implications. First, when 
the SPs and FPs are evaluating the same patient, their ratings tend 
to be more \’alid and more reliable than when the ratings are made 
on different OSCEs. The reliability appears to be more a function 
of the OSCE’s case than the OSCE’s evaluator. Standardized pa- 
tients tend to give slightly higher evaluations than do faculty proc- 
tors. Our data do not indicate whether the FP or the SP is to he 
preferred. 

In summary, global ratings of interpersonal skills arc both reliable 
and valid. Faculty proctors and standardized patients appear to be 
interchangeable as evaluators of interpersonal skills. Case content 
is an important factor that influences residents’ performances of 
interpersonal skills. 

Ct'>rrc>f\>nder\tv: Michael 5. l\)nuelly. Pl'.D, IX-panmci'ii ofSurjjoy,(:-245. Umvcrsiiy 
of Kentucky COM, 800 Rase Street, Lcxincron, KY 40556.0298. 
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• AN OBJECTIVE LOOK AT OSCEs 



Moderator: Sheila Chauvin, PhD 



The Effects of Examiner Background, Station Organization, and Time of Exam on OSCE Scores 
Assessing Undergraduate Medical Students’ Physical Examination Skills 

CHRISTOPHER JAMES DOIG, PETER H. HARASYM, GORDON H. PICK, and JOHN S, BAUMBER 



Since 1975, objective stniccured clinical examinations (OSCEs) 
have gained widespread acceptance as a method of making reliable 
assessments of clinical performance/ Standardized patients (SPs) 
function as patient, teacher, and evaluator by using their bodies as 
teaching and evaluation material. SPs can be asymptomatic, have 
stable findings, or be trained to simulate physical findings. SPs can 
be taught to portray a variety of standardized clinical presentations. 
Their participation in teaching and evaluating the complex clinical 
skills included in OSCEs has been well established/ 

Research has demonstrated that multiple SP stations within the 
OSCE format may generate scores that vary greatly in reliability, 
from 0.20 to 0.95.^’'’ With large fluctuations in scores’ reliabilities, 
research efforts have focused on the variables that can decrease or 
enhance the reliability of measurement. For example, inter^rater 
reliability was found not to be a deterrent to consistent measure- 
ment, and correlations generally varied from 0.80 to 0.90 between 
obseiwers and raters when case-specific checklists were developed 
and if the items reflected observable behaviors. Due to the case- 
specificity' phenomenon described by Elstein, many cases are gen^ 
erally needed to assess clinical competency within a defined prob- 
lem (e.g., chest pain).^ In other words, quality of performance on 
one case is a very poor predictor of performance on another/ How- 
ever, if a single attribute is assessed, the number of cases required 
to attain reliable scores can be decreased (e.g., ten focused cases 
are required to assess the general skill of history taking, eight cases 
for physical examination, and 25 cases for differential diagnosis).’ 

Most OSCE stations employ a single case with a single SP and 
a single observer. How'ever, because of the cost of OSCEs, efficiency 
would favor a station organized with two cases portrayed by a single 
SP. There are no research findings to indicate whether this orga- 
nizational structure could adversely affect the reliability of mea- 
surement of an OSCE candidate’s performance. Furthermore, 
OSCEs often use examiners from varied clinical backgrounds (e.g., 
residents, specialists, or family physicians). Given the importance 
of the OSCE’s evaluation format and its predominant use for teach- 
ing and evaluating clinical skills, there is a need to determine 
whether the reliability of scores would he compromised by a r.uers 
background, a station’s organization, and the time of examination 
administration. 

Method 

Course Overview. The University of Calgary’ medical undergrad- 
uate program is three years in duration, with 11 instructional 
montlis per year. The first two years consist of “systems” -based 
courses using a problem -oriented curriculum tliat is taught in di- 
dactic lectures and small-group sessions. There is also a longitudinal 
medical skills course focusing on professional development and in- 
terdisciplinary skills, including a super\’ised setting for students to 
be instructed in physical examination. A "core document" given 
to each student provides detailed objectives for each physical ex- 
amination maneuver. A standard physical examination textbook is 
recommended, and each student is provided with a six-hour video 
that shows local clinical experts demonstrating physical examina- 
tion maneuvers. Tlie instruction format is by small group. Precep- 
tors are family physicians, specialists, or senior medicine residents. 
All small groups use SPs as insmictional models. Further instruc- 



tion in physical examination is carefully integrated into “clinical 
correlation” sessions within the systems courses. These sessions are 
organized so that the clinical correlation sessions build in an iter- 
ative fashion on skills learned in the sessions of the medical skills 
course. The instmetion is also by small group. However, all pre- 
ceptors are specialists within the area, and they provide patients as 
Instructional models. These sessions expose students to clinical 
findings relative to each system and permit examination techniques 
to be obsert'ed and corrected by a clinical specialist within the area 
of study. At the end of the second year, the students take a certi- 
fying OSCE, the successful completion of which is a requirement 
for promotum into clinical clerkship (third year). 

OSCE Statioir Development. The second-year OSCE consists of 
ten physical examination stations randomly selected from a bank 
of 44 stations. Each station tests one physical examination maneu- 
ver, and all were developed by one author (CJD) using the ap- 
proach described. All maneuvers were selected from the core doc- 
ument’s enabling objectives. Each maneuver was broken dowm into 
individual steps as outlined in the course’s textbook. Each of these 
steps was identified as an item on a computerized examination score 
sheet. Ctitcrion-based scoring was used, with each item scored as 
0 (omitted or incorrect), 1 (partially correct), or 2 (coiTCCt)/ Face 
and content validity of each checklist was established by review 
using a core group of physicians: five course preceptors, five medical 
educators with expertise in evaluation, five physicians with ..xper- 
tise in clinical teaching, and five specialists. The final content of 
each checklist and the minimum performance level (MPL) for each 
station were determined by consensus. It has previously been dem- 
onstrated that the validity of identif^’ing the important items in- 
cluded in an OSCE station is superior when performed by a group 
of faculty compared with one individual.^ Each station had been 
used in previous OSCEs, and the examination’s properties estab- 
lished. 

Examijwtinn Process. The medical skills examination included 
OSCE stations on history taking, physical examination, medical 
bioethics, and culture — health and illness. The examination to- 
taled 3.5 hours, one hour of which was the physical examination 
section. The examination was conducted in one morning and one 
afternoon session. Each candidate completed ten physical exami- 
nation maneuvers. At each station, there was one examiner and 
one standardized patient per pair of maneuvers. At each station, 
there was a short history to provide clinical context for each phys- 
ical examination maneuver and students were given five minutes 
to demonstrate the first examination maneuver. The students then 
had one minute to review a short history for the second maneuver, 
and then five minutes to complete the second maneuver. At the 
end of 11 minutes, the students were given one minute to rotate 
to the next station (located in a separate examination room im- 
mediately adjacent to the preceding station) and to review the 
history for the first of the two maneuvers for the subsequent station. 
The physical layout of each station was standardized, with the pa- 
tient dressed in appropriate examination apparel (but not draped 
or positioned for the examination), an examining raV. , necessary- 
equipment on an adjacent table, and the examiner to one side. 

Physical examination stations were grouped into rw'o streams: 
Stream A paired maneuvers in one station that were from the same 
system or anatomic region, or that required a similar physical exam 
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TABLE 1. Summary of Individual Station and Overall Examination Results on a Ten-station Physical Examination SXiils OSCE 



Examination Maneuver 
(No. Items per Checklist) 




Student Examination Results 




MPLt 


Mean (SD) 


Range 


Proportion Successful 


Ascites (10) 


64.15 


81.09 (16.59) 


14.26-100 


61/69 


Cervical spine (13) 




88.85 (11.10) 


39.94-100 


66/69 


Jugular venous pulse (20) 


67.68 


74.64 (11.87) 


35.45-100 


55/69 


Lung suitace anatomy (14) 




90.69 (11.87) 


14.27-100 


59/69 


Median nerve (16) 


53.78 


61.08 (18.25) 


13.84-100 




Mini-mental status (18) 


73.26 


76.69 (9,07) 


53.28-93.24 


54/69 


Peripheral arterial vasculature (17) 




68.62 (16.96) 


23.04-96.01 


47/69 


Shoulder (17) 


61.47 


72.60 (15.34) 


38.42-100 


54/69 


Spleen (16) 




75.53 (12.91) 


23.78-95.10 


63/6S 


Visual fields (14) 


79.88 


78.29 (16.22) 


14.98-100 


48/69 


Overall (155) 


66.72 


75.93 (7.12) 


56.95-91.37 


65/69 



* These are the ten physical examination maneuvers used during the examination. The maneuvers were paired into Wjo streams ot five stations. One stream paired maneuvers 
from similar body regions or physiologic systems (e.g.. spleen and ascites), and one stream paired maneuvers from non-similar systems (e.g., shoulder and spleen), 
t Minimum performance level or pass level. 



skill; Stream B paired physical exam maneuvers that were not sim- 
ilar in region or skill examined. The pairings and sequence of ex- 
amination maneuvers in Stream A were spleen and ascites, mini- 
mental status exam and median nerve, jugular venous pulse (JVP) 
and peripheral arterial system, shoulder and ccr\'ical spine, and vi- 
sual fields and lung surface anatomy (the final pairing representing 
an understanding of clinical correlative anatomy). The pairings and 
sequence of examination maneuvers in Stream B were cercdcal 
spine and JVP, ascites and peripheral arterial system, lung surface 
anatomy and median ner\'e, mini-mental status and visual fields, 
and shoulder and spleen exams. Each stream ran in parallel during 
the morning and afternoon sessions. The pairings and examination 
maneuver sequences within the two streams remained constant be- 
tween the morning and afternoon sessions. 

Each examiner was a physical examination course preceptor. Two 
weeks prior to the exam, the examiners were sent the following 
station-specific information: a photocopy of the maneuver-specific 
objectives, a photocopy of the textbook describing the examina- 
tion, and the station checklist. Each examiner was asked to review 
the appropriate section of the videotape (the videotape had been 
previously provided). An instructional session was held with all 
examiners to review the stations’ expectations, checklists, and per- 
formance, and to discuss concerns. The examiners were nor aware 
of the methtxl of station validation, or the stations* minimum per- 
formance levels (MPLs). Six examiners were internal medicine re.s- 
idents, eight were family practitioners, and six were specialists. 

An administrative assistant, unaware of the study’s hypotheses, 
randomly allocated both the examiners and students to Streams 5 
A and B, and times of examination (a.M. or r.M.). 

Statisncal Analysis. V/e hypothesized that the type of examiner, 
the srations’ pairings of maneuvers that required similar content 
knowledge (extrapolated as being from the same examination sys- 
tems), and times of examination would not contribute significant 
variance to the overall measure of examination reliability. For anal- 
ysis, we used the general estimating equation (GEE) method, a 
modification of the generalized linear model (GLM).'^' GEE mod- 
eling is a robust and validated method of random-effects multivar- 
iate modeling that estimates general linear models but also permits 
a priori specification of a within-scudent correlation structure. In 
summary, the model provides an analysis of variance, but permits 
control of the potential effect of unequal distrihulion of data and 
the necessity to account for repeated measures. We used the ex- 
changeable correlation structure within the GEE method to esti- 
mate the effects of the individual covariates (and any interactions) 
on the dependent variable of student performance.*^^ As the se- 
quence of examination maneuvers at each station was held con- 



stant within each stream, this was not included in the final analysis 
model, nor did we model within-examiner correlations. All analyses 
were performed with a statistical software package. 

Results 

Sixty-nine of 70 eligible students completed the examination: 35 
were randomized to Stream A, and 34 to Stream B. The exami- 
nation was structured, based on the availability of standardized pa- 
tients, to have an unequal distribution berw'een morning and af- 
ternoon sessions. Of the 69 students, 40 students were assigned to 
the morning examination, and 29 to the afternoon examination. 
Six examiners were residents, eight examiners were family physi- 
cians, and six examiners were specialists. The examiners were 
equally distributed between both streams and between morning and 
afternoon sessions. 

The alpha coefficient for the examination was 0.84. The MPL 
for the examination was 66.85%, based on an equal weighting of 
the MPLs from the ten stations. Sixty-five of the 69 students were 
rated satisfactory on the overall physical skills examination. The 
mean performance was 76.81% ± 7.35 (mean ± SD). The range 
was from a low score of 56.51% to a high score of 92.28%. The 
performances at the individual srations are presented in Table 1. 
The t)vcrall mean .score for candidatc.s observed by senior internal 
medicine residents was 75.55%, that for candidates obscrv-cd by 
family physicians was 79.22% (|? = 0.07 compared with residents 
or specialists), and that for candidates observed by specialist,s was 
75.28?‘^i (f) = 0.38 compared with residents). No practical difference 
was observed in the candidates’ perfonnances by stream assignment: 
Stream A 77.00% and Stream B 76.61%. No practical difference 
was ohscrv'cd hctw'ccn the perfonnances of candidates during the 
morning sessions (77.51%) and candidates during the afternoon 
sessions (76.00%). There was no within-stream hetwccn-cxamincr 
effect, and no within-time of examination between-examiner effect 
demonstrated. An unexplained difference was (,)bser\’ed between 
the interaction of stream assignment and time of examination: 
morning session Stream A = 74-50%, Stream B = 80.52%, and 
afternoon session Stream A *- 80.59% and Stream B = 71.40%, 
This ohserv'cd interaction could not be explained by an effect of 
examiners. Given that the SPs and the pairings and sequences of 
examination maneuvers within the stations did not change, and in 
the absence of an alternate plausible explanation, the olw ?ved in- 
tciaction was presumed ro he due to a random effect of individual 
candidate perfonnances. 
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Conclusions 

Using a sound research design and robust analytic techniques, there 
was no evidence from this study that the variables — station orga- 
nization» time of examination, and clinical background of examiner 
— contributed significant variance to the overall reliability of an 
OSCE assessing physical examination skills. With two parallel 
streams, and therefore two SPs’ simulating the same physical ex- 
amination maneuver, we assessed and found no difference in the 
between'SPs’ mean value (form-within-case difference, as previ- 
ously suggested by Battles*’) for each physical examination maneu- 
ver (data not presented), which supports a conclusion that bias in 
our results was not introduced by the two SPs' demonstrating the 
same maneuver. Our assessment of only physical examination ma- 
neuvers is similar to the study of Kowlowitz and colleagues and that 
of Li and colleagues, and supports the reliability of our examina- 
tion.*'*^ The difference in examiners’ performances between family 
physicians and internal medicine residents or specialists did dem- 
onstrate a trend toward significance, and the absence of a statisti- 
cally dift'erent result may have reflected a t>pe II enor. The effect 
of the examiner’s background on rating students’ performances re- 
quires further study. 

Though OSCE examinations have gained widespread accep- 
tance, major practical impediments remain in their cost and their 
labor-intensive organization. Reznick estimated the total costs for 
developing an OSCE and administering it to 120 students in a 
single medical school lo be from a high of $104,400 to a low of 
$59,460, or $496 to $870 per student (Canadian denomination — 
CND).*^ For administering the exam only, costs ranged from 
$19,200 to $34,500 (CND) if examiners and SPs were paid, or from 
$16,500 to $19,200 (CND) if only SPs were paid (both estimates 
include catering costs ior both examiners and SPs). In previous 
examinations using ten physical examination maneuvers, hut with- 
out pairing of maneuvers within one station, we required 40 ex- 
aminers and 40 standardized patients. The large numbers of ex- 
aminers and SPs were a significant cost and administrative burden 
for our examinations, and they were important factors in our adopt- 
ing the paired station strategy. In two previous examinations with- 
out paired stations, these examinations had an average alpha of 
.76. Our current study’s findings support the premise that the pair- 
ing and sequencing of stations will not reduce the reliabilit ; of the 
assessment of a candidate’s performance. 

Reorganizing the assessment of physical examination skills 
within an OSCE by station by using maneuver pairing may con- 



tribute to improvement in overall efficiency and provide significant 
cost savings by reducing the numbers of SPs and examiners needed. 
Whether this can be applied in the assessment of other clinical 
skills in an OSCE requires further evaluation. 

Corrcsfvndcncc: Dr. Christopher Janies Doig, Assistant Pfotess<^r, RcK.'m EG2^G,Fix»t- 
hills Medical Centre, M03 29th Street NW, Calgary*, AB, Canada T2N 2T9; e-mail 
<cJoiy@ucalgary.c:i>. 
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• THE EYE OF THE BEHOLDER 



Moderator Linda Dtstlehorst, PhD 



Content, Culture, and Context; Determinants of Quality in Psychiatry Residency Programs 

RACHEL YUDKOWSKY and ALAN SCHWARTZ 



Residency training programs vary across characteristics such as their 
didactic and clinical experiences, attributes of the incoming resi- 
dents, faculty characteristics, research conducted, community ser- 
vice performed by the program, and eventual practice choices of 
graduates. Which of these characteristics are most salient for eval- 
uating the quality^ of a program? 

Elliott* lists characteristics of graduates, cost-effectiveness, fair 
and ethical treatment of trainees, and meeting societal needs as 
important quality indicators. Iverson^ takes a dimensional ap- 
proach. His dimensions, with metrics, are: intake [U.S. Medical 
Licensing Examination (USMLE) scores of matched applicants]; 
customer satisfaction [percentage of available positions filled by 
match, and percentage filled by U.S. medical school graduates 
(USMGs)]; residency review committee (RRC) quintile scores; and 
outcome (specialty board pass rates). The Accreditation Council 
for Graduate Medical Education (ACGME) recently switched its 
focus from process variables to outcomes, and is encouraging RRCs 
to evaluate a program on how well it provides for six core com- 
petencies: patient care, clinical science; interpersonal skills and 
communication, professionalism, practice-based learning and im- 
provement, and systems-based practice.’ 

In 1997, a task force of the American Association of Directors 
of Psychiatry Residency Training (AADPRT) developed a sur\'ey 
to define the variables important to determining a program’s quality- 
from the psychiatry’ resident’s perspective. The 41 'item question- 
naire was based on feedback from focus groups of psychiatry resi- 
dents and program directors and a review of the literature. A total 
of 180 psychiatry residents from 16 programs completed the survey. 
Quality’ of supervision and teaching conferences, respect of faculty 
for residents, responsiveness of the program to feedback from resi- 
dents, and morale in the department were the items most important 
to residents* satisfaction. A detailed description of the construction 
of the survey and its results was published by Elliott.^ 

In 1998, the AADPRT s survey was repeated with psychiatry 
residency directors and heads of major rotations to see whether 
their values agreed with those of the residents.’ This paper describes 
the use of multidimensional scaling (MDS) of the sun^ey’s results 
to establish whether there were distinct groupings of program di- 
rectors with different opinions about the determinants of quality in 
psychiatry residency programs. These groupings »night represent 
types of psychiatry programs (or market niches) as reflected in the 
values and priorities of their faculty- and directors. 

Method 

Multidimensional scaling is an analytic technique frequently used 
in marketing research to identify the psychological dimensions un- 
derlying customers’ preferences with respect to multiple variables 
or features of a product.'' In MDS the difference between clusters 
or groups of variables is predicted by the distance between the 
variables in psychological “space,” with the dimensionality of the 
space equal to the number of relevant dimensions underlying 
the data. These dimensions can be thought of as analogous to the 
latent constructs derived in factor analysis. The scaling algorithm- 
derives the dimensions and plots the coordinates of the variabld.nn, 
the resulting multidimensional space. MDS is an inherently inter- 
pretive procedure — it locates variables on dimensions but requires 



the investigator to determine whether the dimensions can be in- 
telligibly labeled. 

Individual Differences Scaling (INDSCAL) is an MDS algorithm 
that models both the overall dimeTisions that underlie the percep- 
tions of the group of respondents and individual u'eighrs on those 
dimensions for each respondent, allowing individuals to vary in the 
imponance they attach to each of the dimensions. For example, 
for one individual, dimension A (the educational resources avail- 
able, for example) may be highly salient, while dimension B (the 
administration of the program, for example) is relatively unimpor- 
tant. For another individual, these priorities may be reversed. By 
examining the distribution of subject weights one can identify clus- 
ters of subjects who share similar values regarding the relative im- 
portances of the various dimensions. 

The questionnaire was sent in late 1998 to all psychiatry resi- 
dency directors listed in the American Medical Association’s 1998- 
1999 Directory of Accredited Graduate Medical Education (GME) 
Programs. The faculry members who sen’cd as the heads of the 
inpatient and outpatient psychiatry rotations of each program were 
also surv’eyed. These are the two major rotations of psychiatry train- 
ing programs, and the opinions of the heads of these rotations 
(henceforth referred to as service chiefs) would most likely repre- 
sent the dominant values of the program. 

Tlie sur\’ey asked directors and ser\*ice chiefs to rate how im- 
portant the 41 items of the questionnaire were in determining the 
quality of a residency program. The anchors were 1 = least impor- 
tant, 4 = average importance, and 7 = most important. 

Multidimensional scaling using INDSCAL was done on the sur- 
vey responses. Solutions in two to six dimensions were generated. 

Results 

Of the 186 active programs listed in the GME director\’, 11? pro- 
grams (63%) responded to the sur\’cy. There were 234 individual 
responses from the 117 programs. Of these, 142 (61%) were from 
program directors and 92 (39%) were from sen-’icc chiefs who were 
not identified as directors. For some programs the head of inpatient 
or outpatient sendees also scr\*ed as an associate program director, 
confirming our supposition that these faculty members represent the 
administrative backbone of the program. 

The Pearson correlation between the responses of the residency 
directors and those of the service chiefs was 0.98 (p < 0.01). We 
therefore pooled data from both chiefs’ and residency directors’ re- 
sponses for the following analysis. 

The rw'o-dimensional INDSCAL solution was degenerate and 
wa.s discarded. The solutions in three to six dimensions were in- 
spected for goodness of fit and interpretability. The three-dimen- 
sional configuration provided the most interprctable dimensions, 
and accounted for 46.4% of the variance in the data; higher-di- 
mensional solutions accounted for only slightly more variance. 

Based on examination of the locations of the items, particularly 
those that had particularly high or low coordinates in each dimen- 
sion, the dimensions seemed to correspond to three constniccs: 
“curriculum,” “quality of the institution,” and “supportiveness of 
the administration of the program.” The three dimensions, and the 
highe.st'loading items on each, are given in Table I . 

Subject weights measure the impoitance or salience of each di- 
mension to each respondent; they range from 0 (completely ig- 
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Table 1. Three Dimensions of Quality of Psychiatry Residency Programs, 
Based on Multidimensional Scaling of Responses by Residency Directors 
and Service Chiefs to a 1998 Questionnaire* 



Dimension 


Questionnaire Items That Load Highly 
on the Dimension 


Curriculum 


Quality of supervision, training in biomedical and 
psychosocial psychiatry and the balance be- 
tween them, diversity of patients and settings, 
opportunities for continuity of care, responsi- 
bility for patient care 


Quality of the 
institution 


Academic reputation of institution, clinical repu- 
tation of faculty, opportunities for research and 
teaching: board scores of graduates, job sat- 
isfaction of graduates 


Supportiveness of the 
program adminis- 
tration 


Fairness in evaluation of residents, respect of 
faculty towards residents, personal qualities 
and administrative abilities of the program di- 
rector, responsiveness of the program to feed- 
back from residents 



* The questionnaire asked respondents to rate the importance of 41 items in deter- 
mining the quality of psychiatry residency programs. A total of 234 program directors 
and service chiefs from 117 programs completed the questionnaire. 



nored) to 1 (overwhelmingly important), and need not sum to 1. 
In addition, each respondent is assigned a ‘Veirdness” value, which 
measures the similarity of his or her responses to those of the typical 
respondent, based on the relative importance of each dimension 
and the goodness of ht for that respondent. 

There was a great deal of variation in individual preferences, but 
no distinct clusters were evident. Notably, although weights for the 
supportiveness of the administration of the program dimension fell 
between 0.25 and 0.40 for nearly all respondents, the imponance 
attributed to the dimensions of curriculum and quality' of the in- 
stitution varied extensively across individuals. Figure 1 plots the 
weights of curriculum and quality of the institution against one 
another for each respondent. The unshaded polygon encloses data 
for more typical respondents with less than the median weirdness, 
represented as circles; the two shaded polygons identify two groups 
of less ty'pical respondents with more than the median weirdness, 
represented as crosses. Two jespondents in the upper left had ex- 
treme (outlier) weirdness values. 

While the most ty'pical respondents gave curriculum weights be- 
tween 0.3 and 0.6, and quality' of the institution weights between 
0.2 and 0.4, two groups of respondents displayed different weight 
patterns. One group (lower right) gave curriculum substantially 
higher weights than typical (ranging from 0.5 to 0.75); the other 
group (upper left) gave quality of the institution substantially 
higher weights than typical (ranging from 0.25 to 0.65). On the 
average, most respondents’ data displayed a continuum of weight 
patterns in which curriculum was considered to be more important 
than cither quality of the institution or supportiveness of the ad- 
ministration of the program; the respondents with the lowest weird- 
ness scores weighted these dimensions in the proportions 1.3: 1:1, 
respectively. 

Discussion 

The three dimensions that emerged from the MDS are consistent 
with the many suggested quality indicators reviewed ahtn e.' ’ How 
might we conceptualize this triad? 

The dimensions of curriculum and supportive ness reflect two dif- 
ferent aspects of the process of residency training. The curriculum 
dimension describes the content of the educational ^r^am; the 
supportiveness dimension reflects the culture or ypl^a^e within 



Subject Weights 




Figure 1. Subject weights for the “curriculum” and “quality of the institution" 
dimensions of the threo'dimonsionai 1NDSCAL solution. The unshaded polygon 
encloses subjects with less than the median weirdness (based on weights in all 
three dimensions), represented as circles; the two shaded polygons identify two 
groups of subjects with greater than median weirdness, represented as crosses. 
Two subjects tn the upper left had extreme (outlier) weirdness s’alucs. 
"Weirdness” measures how similar each respondent is to the typical respondent, 
based on the relative importance of each dimension and the goodness of fit for 
that respondent. 



which the training occurs. Residency directors and their faculty' 
seem to differentiate heuveen these two aspects of the program, and 
value both as indicators of the quality* of the program. The dimen- 
sion of institutional quality includes items reflecting the reputation 
and resources of the institution as well as items generally considered 
to he outcomes (i.e., board scores of graduates and graduates’ job 
satisfaction). In this instance, these “outcome” items probably serv’e 
as proxies for (and a reflection oO the reputation of the institution 
and the quality of the residents it attracts, rather than as true out- 
come measures- This dimension could represent a general context 
factor, reflecting the quality of the facilities, the faculty', and of the 
residents tliemsclves. This general factor could itself modify' the 
effects of the process variables either up or down, thereby affecting 
the expected outcomes. Thus, this dimension may reflect the ex- 
pectation of program directors and chiefs that equivalent processes 
(curriculum and supportiveness) could lead to better outcomes if 
they are provided in the c<->ntext of a higher-quality institution. 

Donabedian lists input, process, and product as the dimensions 
of quality in healthcare.' Interestingly, product (outcomes) did not 
emerge as a dimension of the quality of a program. Perhaps resi- 
dency directors and sersicc chiefs focus on process variables as in- 
dicators of quality since it is in the process of residency training 
that they deal. The neglect of outcome measures may also reflect 
a philosophy that, while the program is responsible for leaching, it 
is the residents’ responsibility to learn. Outcome measures are 
highly confounded hy the abilities and characteristics of the indi- 
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vidual residents and, thus, may not be considered an accurate or 
reliable measure of the quality of the program per se. 

The ACGME and others who have begun the move towards 
outcome evaluation may wish to take this as a word of caution. 
Outcomes should not be considered in a vacuum. For at least some 
key stakeholders — the residents and faculty of the program — the 
context, content, and culture of the program are significant as w-ell. 

No truly distinct clusters or groups of respondents emerged from 
the multidimensional scaling of the data. While there may be pro- 
grams with different missions — programs oriented towards research 
or community psychiatry, for example™these missions do not seem 
to result in drastically different definitions of quality. This suggests 
that there is a core concept of quality that holds across contexts 
and across missions — consistent with the RRC*s model of minimum 
standards. 

On the other hand, there seems to be a continuum of individual 
variation, rather than variation based on group membership. The 
individual respondents differed widely in the dimensions they con- 
sidered most important. Since the individuals in this case are the 
faculty leaders and directors of the programs, these priorities are 
most likely reflected in the programs as a whole. The individual 
variation can be usefully segregated into three “market niches," 
corresponding to the three polygons in Figure 1. Thus we could 
describe three types of programs: ( 1 ) programs in which the quality 
of the sponsoring institution (context) is paramount, (2) programs 
in which the quality of the curriculum (content) is paramount, and 
(3) programs with a more typical weighting of the three dimen- 
sions. 

While there was less variability in the importance attached to 
the supportiveness (culture) of the program, this should nor be 
construed as lack of salience. Rather, all programs should be alert 
to the importance of this dimension. 

Residents, too, vary in the levels of importance they attribute to 
the various features of a training program.'' The context, content, 
and culture of a program may provide a good conceptual model of 
the dimensions along which the market varies. Programs may find 
it useful to identify and market themselves on the basis of these 
dimensions. 
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This study focused on only one of the many stakeholder groups 
of residency programs. The needs and expectations of other stake- 
holders, such as program funders and employers of a program’s grad- 
uates, remain to be defined. This study also focused only on psy- 
chiatry programs, but the dimensioiw of context, content, and 
culture would seem to be potentially applicable to other specialties 
as well. Repeating the study with other specialties will tell us 
whether indeed this triad is relevant to the quality’ of programs 
across specialties. 

The method of multidimensional scaling is a novel one for de- 
termining quality measures in graduate medical education. Re- 
peated use of this technique, across stakeholders and across spe- 
cialties, can help elucidate the factors most important to the 
evaluation of residency programs. 

Corresptindence; Rachel Yudlcowsky, MD, Department of Medical Education, MC 591 
UIOCOM. 808 South Wood Street, Chicapo, IL 60612.7309; e-mail <racheh'@ 
uic.edu); Reprints are not available. 
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• THE EYE OF THE BEHOLDER 



Moderator: Linda Distlehorst, PhD 



Gauging the Outcomes of Change in a New Medical Curriculum: 
Students’ Perceptions of Progress toward Educational Goals 

GREGORY MAKOUL. RAYMOND H. CURRY, and JASON A. THOMPSON 



After decades of concern about the lack of momentum in reforming 
medical curricula, a number of schools have introduced significant 
revisions and innovations in recent years. In most cases, the goals 
of these changes have followed the general principles promulgated 
by the Association of American Medical Colleges’ ( AAMC’s) Gen- 
eral Professional Education of the Physician (GPEP) and College 
Preparation for Medicine Report and other similar documents.' ’ 
Objectives consistent with these goals have been codified and dis- 
seminated through the AAMC’s Medical School Objectives Project 
(MSOP).' Several new educational strategies (e.g., problem-based 
learning) and course domains (e.g., courses in professional skills and 
perspectives) have become common elements of the resulting cur- 
ricular initiatives at many medical schools.^ 

Given the need to track the effects and effectiveness of change 
in medical education programs, Makoul developed the Student 
Perception Survey,^ which focuses on how students view both the 
learning environment and their own learning experiences. It was 
first administered at Northwestern University Medical School in 
1993, and has since been used by medical schools at the University 
of Chicago, Washington Universit>s University’ of Utah, Medical 
University of South Carolina and, most recently, the University of 
Minnesota at Duluth. This study limits analysis to data collected 
at Northwestern between 1993 and 1999. 

Context 

In 1993, Northwestern University Medical School implemented a 
totally new first- and second-year (Ml -M2) curriculum. Other, less 
sweeping, changes in the clinically oriented third- and fourth-year 
curriculum have been made more incrementally over the past de- 
cade, and are not a focus of this report. While some improvements 
have been made in our nearly seven years of experience with the 
Ml -M2 curriculum, the basic concept and format are still firmly 
in place. The curriculum is composed of four courses, each pre- 
sented in a scries of discrete, topically focused units.'"' Each course 
and nearly every unit are interdisciplinary’ in nature and draw fac- 
ulty from a number of departments; all are managed and funded 
centrally by the deans administration. 

Two areas of emphasis differentiate the current M1-M2 curric- 
ulum from its predecessor. TT»e first is a change in the way we 
expect students to learn medicine. Our students are now explicitly 
regarded as adult learners, with a wide variety of backgrounds, ap- 
titudes, and learning styles. Adult education models embrace this 
diversity and provide a framework for continuous self-directed ed- 
ucation beyond the formal curriculum. Moreover, the very nature 
of the profession demands that students learn to ’’think on their 
feet,” relating different areas of knowledge one to another and serv- 
ing as critics of their own and others’ reasoning processes. Accord- 
ingly, the curriculum provides a variety of learning formats, with 
an emphasis on interactive, discussion-based small-group activities. 
In addition, the clinical skills units include peer observation and 
feedback on a regular basis,'® " 

The second emphasis is a dramatic increase in the attention paid 
to issues of professional perspectives and professional skills. As de- 
tailed hy Curry and Makoul,'* attention to students’ interpersonal 
skills and attitudes and to the interface of the medical profession 
with society at large had grown steadily fi>r some years. Nor until 




the early 1990s, however, did schocjls begin to address these issues 
comprehensively. Since then, professionalism has become much 
more visible on the medical education agenda.' The conceptual 
framework of patient-centered medicine (also referred to as rela- 
tionship-centered medicine), which highly value.') the physician’s 
capacity for empathy, attentive listening, and concern for the pa- 
tient’s perspective,'" has been instrumental in bringing about these 
changes. 

The very’ breadth and comprehensiveness of significant educa- 
tional reform make it difficult to reliably evaluate the specific im- 
pact of any component. Further, consistent with the focus on adult 
learning and professional development (i.e., we want our students 
to mature as self-aware professionals), wc consider students’ per- 
ceptions to be an important element of curriculum evaluation. We 
used the Student Perception Sur\’ey as our program evaluation tool 
because it offers a broad view of students’ attitudes and experiences. 
For instance, we were interested in assessing, over a period of years, 
whether the new Ml -M2 curriculum affected students’ perceptions 
about the imporiance of key educational goals, and whether it had 
an effect on their perceived progress toward those goals. 

EducatiOTUiI Goals: lml)ortance. There is some concern chat med- 
ical students become less idealistic and more cymical as they pro- 
gress through the curriculum,"'^ On the other hand, students arc 
likely to place more emphasis on areas relevant to clinical practice 
as they approach the clinical clerkship pha.se of their education. To 
assess w’hether students place more or less value on key educational 
goals after their first two years of medical school, we can compare 
responses to the Student Perception Surs'eys administered to in- 
coming students with those to sur\-cys administered to the same 
students at the end of their second year (just before clinical clerk- 
ships begin). Since we expect that incoming students will highly 
value all of the goals, thus generating a ceiling effect, we do not 
expect the importance ratings to rise. Neither do we expect them 
to fall, since the new curriculum attempts to reinforce the value of 
these goals. Thus, our expectations regarding importance ratings 
are phrased as our first (null) hypothesis: There will be no statis- 
tically significant difference in the importance ratings when Stu- 
dent Perceptions Sur\’eys administered to incoming students arc 
compared with those administered at the end of the second year. 

Educational Goals: Progress. Attending physicians’ comments re- 
garding the readiness and performances of students in their clerk- 
ships provide one good indication of whether a new M1-M2 cur- 
riculum is effective. However, it is difficult to systematically 
evaluate progress toward a variety' of goals with such a method. 
Since we have a pass — fail grading system, the only grade-like met- 
ric available is the U.S. Medical Licensing Examination (USMLE) 
Step 1 score, also poorly suited to address a diverse set of goals. 
The Student Perception Sur\^ey allows us to assess students’ views 
about the extent to which the curriculum has helped them progress 
toward each of the goals listed in Table 1. A brief “In Pre^gress” 
article published in Academic Medicine reported immediate positive 
changes in ten of the 16 educational goals when data collected 
from the class of 1996, which progressed through the first two years 
before the curriculum was implemented, were compared with data 
from the classes of 1997 and 1998, the first cohorts to complete 
the new Ml -M2 curriculum.*' Since we expect the revised curric- 
ulum to prove effective in maintaining those changes, wc offer the 
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Table 1. Rfisponses to Importance of Educatioiiat Goals Section of the 
Student Perception Survey by Incoming and Experienced Students at 
Northwestern University Medical School. Classes of 1997-2001* 



Educational Goal 


Incoming 
Students 
Mean (SD) 


Experienced 
Students 
Mean (SD) 


Learn the language and information neces- 
sary for practicing medicine 


3.90 (0.30) 


3.85 (0.39) 


Master skills for eliciting information from 
patients 


3.84 (0.37) 


3,84 (0.39) 


Master physical examination skills 


3.83 (0.40) 


3.85 (0.39) 


Become proficient in clinical decision mak- 
ing 


3.86 (0.37) 


3.76 (0.54)^ 


Master skills for providing Information to 
patients 


3.73 (0.48) 


3.71 (0.52) 


Master skills for communication with col- 
leagues 


3.58 (0.55) 


3.62 (0.58 


Learn how to manage time more effec- 
tively§ 


3.30 (0.77) 


3.34 (0.78) 


Become more aware of ethical issues in 
medicine 


3.38 (0.66) 


3.25 (0.72)4 


Become more proficient at learning on your 
own 


3.30 (0.79) 


3.41 (0.74)t 


Develop skills that will enhance lifelong 
learning 


3.46 (0.69) 


3.51 (0.67) 


Develop skills for practicing health promo- 
tion and disease prevention 


3.51 (0.65) 


3.58 (0.61) 


Understand how the stresses of life as a 
physician will affect your personal life§ 


3.18 (0.80) 


3.24 (0.82) 


Identify strengths and weaknesses in your 
academic and clinical abilities 


3.51 (0.65) 


3.52 (0.63) 


.Become more comfortable when being as- 
sessed by your peers§ 


3.09 (0.84) 


3.05 (0,89) 


Gain a full appreciation for political, eco- 
nomic, and social influences on health 
care§ 


3.18 (0.74) 


3.20 (0.78) 


Improve your problem-solving skills 


3.49 (0.65) 


3.63 (0.59)t 



‘ Medical students completed the Student Perception Survey at the start of their first 
year and again at the end of their second year. At each time point, the Educational Goats 
section of the survey asked them to rate the importance of the 16 goals reproduced in 
this table, using a five-point scale that runs from 0 {“not at all important”) to 4 (“ab- 
solutely essential”), with the intervening points labeled as v/elL Paired Mests were run 
to determine whether student perceptions of the goals changed significantly after two 
years in medical school (n = 511 pairs), 
tp < .01, two-tailed. 
ip< .001, two-tailed. 

§ These goals were added by faculty who developed the new curriculum. The remain- 
der are operationalizations of the original eight goals for medical education.'” 



second hypothesis: Srudenr.s who have progressed through the new 
curriculum will report more progress toward educational goals than 
will students who completed the survey before the new curriculum 
was in place. 

Method 

Student Perception Survey. The survey gathers information about 
medical students’ perceptions regarding faculty contact, educational 
goals, educational activities, and patient-centered tasks of care. It 
also gauges learning orientation, social orientation, career plan, 
conceptions of health, and demographic information. It is admin- 
istered longitudinally via scan-form or computer: once at the be- 
ginning of medical school (i.e., during orientation week) and again 
at the end of the second year (i.e., just before clerkships). (We ran 
a study in 1997 to compare pencil-aiid-papcr, scan-form, and com- 
puter versions of the survey; no difference in response patterns was 
detected.) This report includes data collected at both time points 
from students in the classes of 1996-2001. The survey is usually 
completed by all students in each cohort; it was distrihurej to fewer 
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second-year students in 1995 and 1996, and fewer incoming stu- 
dents in 1998, due to administrative errors. Social security' numbers 
serve as identification tags, allowing us to match sun'eys from in- 
coming and experienced students without accessing their names or 
creating another sec of identification numbers. 

Educational Goals, In 1990, the dean, with the approval of all 
department chairs and senior deans, established eight goals for med- 
ical school education.'"' The 16 goals assessed in the Educational 
Goals section of the Student Perception Sur\'ey (sec Table 1 ) were 
developed by explicating these original eight (e.g,, operationalising 
“communication”) and then expanding the list to include four ad- 
ditional goals expressed by faculty who had developed the new 
curriculum for the first two years of medical school. Table 1 indi- 
cates which of the goals were added. Nunnally emphasized that the 
plan and procedure of an item’s generation is a primary determinant 
of its content validity.'^ Drawing the items directly from goals out- 
lined by the medical school certainly enhanced content validity. 
Further support comes from the observation that these goals are 
not unique to Northwestern; they are reflected in blueprints for 
medical education,*'’ deemed relevant by the other schools using 
the Student Perception Survey, and in the expressed values of prac- 
ticing physicians.'^' The items also have representational validity, 
as pilot tests conducted during the surv'cy-dcvelopment pr^Kess in- 
dicated that medical students understood these items as intended.*' 
The Educational Goals section of the survey asks both incoming 
and experienced students to rate the importance of these 16 goals 
on a scale ranging from 0 = “not at all important” to 4 = “absolutely 
essential.” The intervening scale points are labeled 1 = “slightly 
important,” 2 = “moderately important,” 3 = “very' important.” The 
survey administered at the end of the second year also asks students 
to indicate the extent to w'hich their medical school experience 
has helped them progress toward each goal. The scale for measuring 
progress ranges from 0 = “not at all” to 4 = “completely.” 

■ Importance. To test our first (null) hypothesis which posits little 
change in how students value the various educational goals, we 
performed paired t- tests on data from surveys administered to 
incoming and experienced students in the classes of 1997 
through 2001, all of whom had been exposed to the new curric- 
ulum. Since we assert the null hypothesis, statistical power is an 
important consideration. Simply stated, the power of a test is the 
probability of rejecting the null hypothesis when it is indeed 
false. Given the large sample of matched pairs (n = 511), we 
chose a fairly conservative a level to avoid highlighting differ- 
ences of trivial magnitude. At a = .01 (two-tailed), we have 
stati.stical power greater than .98 for detecting small to medium 
effect sizes.*^ 

■ Progress. To test our second hypothesis, which states that the 
new curriculum should he associated with greater perceptions of 
progress toward the educational goab, wc performed indepen- 
dent-sample t- tests on data from surveys administered to expe- 
rienced students (those at the end of their second year). (One- 
way ANOVAs indicated that data from the classes of 1997 
through 2001 could be combined because they were statistically 
similar. Thus, we ran r-tests to facilitate presentation and inter- 
pretation of results.) We compared the perceptions of students 
in the clas,s of 1996 (n = 165), who had experienced the old 
curriculum, with those of students in the classes of 1997 through 
2001 (n = 603). Again, the large sample size affords good statis- 
tical power. At a = .01 (two-railed), wc have statistical power 
greater than .80 for detecting small to medium effect sizes via 
these independent-sample t-tests.'** 

Results 

Importance. On average, the students rated all of the educational 
goals from “very importanr” to “absolutely essential” (sec Table 1). 
When sur\'cys administered at the two time points were matched 
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Table 2. Experienced Students' Perceived Progress toward Educational 
Goals* While in the Old Curriculum (Class of 1996) 

Versus the New Curriculum (Classes of 1997-2001), 
Northwestern University Medical School 



Educational Goal 


Old 

Curriculum 
(/J = 165) 
Mean (SD) 


New 

Curriculum 
(n = 603) 
Mean (SD) 


Learn the language and information neces- 
sary tor practicing medicine 


2.67 (0.67) 


2.86 (0.65)t 


Master skills for eliciting information from 
patients 


2.80 (0.65) 


2.91 (0.69) 


Master physical examination skills 


2.58 (0.71) 


2.48 (0.76) 


Become proficient in clinical decision mak- 
ing 


2.02 (0.83) 


2.29 (0.81 )t 


Master skills for providing information to 
patients 


1.81 (0.99) 


2.31 (0.90)t 


Master skills for communication with col- 
leagues 


2.25 (0.91) 


2.52 (0.86)t 


Learn how to manage time more effec- 
tivelyf 


2.19 (0.Q7) 


2.26 (1.04) 


Become more aware of ethical issues in 
medicine 


2.62 (0.78) 


2.88 (0.80)t 


Become more proficient at learning on your 
own 


2.60 (0.82) 


2.93 (0.89)t 


Develop skills that will enhance lifelong 
learning 


2.47 (0.80) 


2.75 (0.86)t 


Develop skills tor practicing health promo- 
tion and disease prevention 


2.22 (0.78) 


2.34 (0.90) 


Understand how the stresses of life as a 
physician will affect your personal lifet 


1.93 (1.03) 


1.95 (1.07) 


Identify strengths and weaknesses in your 
academic and clinical abilities 


2.36 (0.84) 


2.38 (0.88) 


Become more comfortable when being as- 
sessed by your peersT 


1.92 (0.89) 


2.27 (0.97)t 


Gain a full appreciation tor political, eco- 
nomic, and social influences on health 
carel: 


1.82 (0.90) 


2.24 (0.93)t 


improve your problem-solving skills 


2.42 (0.77) 


2.74 (0.78)t 



•The Educational Goals section of the Student Perception Survey, distributed to stu- 
dents at the end of their second year, asked them to indicate the extent to v/hich their 
medical school experience had helped them progress toward each of the 16 goals re- 
produced in this table, on a scale running from 0 (“not at all”) to 4 (“completely”). 
Independent-sample Mests were run to determine whether there were significant differ- 
ences between the perceptions of students who had experienced the old curriculum 
(class of 1996) as compared v;ith those who had experienced the new curriculum 
(classes of 1997-2001). 
tp < .001, two-tailed. 

i These goals were added by faculty who developed the new curriculum. The remain- 
der are operationalizations of the original eight goals for medical education.’'^ 



and importance ratings were compared via paired t- tests, we found 
statistically significant, though relatively small, differences (A) in 
how the students valued four educational goals. Importance ratings 
increased for “become more proficient at learning on your own” (A 
= .11, p < .01) and “improve your problem solving skills” (A = .14, 
p < .001 ); they decreased for “become proficient in clinical decision 
making” (A = ~.10, p < ,001) and “become more aware of ethical 
issues in medicine" (A = — .13, p < .001). 

Progress. The students’ mean ratings of the extent to which their 
experiences had helped them accomplish each goal were closer to 
the scale’s mid-point that were the importance ratings (see Table 
2). Students completing the new Ml -M2 curriculum reported sig- 
nificantly more progress toward ten of the educational goals than 
did the cohort that progressed through the first two years before 
the new curriculum was implemented. The biggest changes were 
associated with “master skills for providing information to patients" 
(A = .50, p < .001), “gain a full appreciation for political, eco- 
nomic, and social influences on health care" (A = .42, p < .001), 



“become more comfortable when being assessed by your peers" (A 
= .35, p < .001 ), “become more proficient at learning on your own” 
(A = .33, p < .001), and “improve your problem solving skills” (A 
= .32, p < .001). The only decrease was associated with “master 
physical examination skills (A = —.10, ns). 

Since distributions for some of the importance and progress items 
v^ere not normal, we also ran nonparamctric tests (Wilcoxon 
signed-ranks test for importance, Wilcoxon-Mann- Whitney test 
for progress). The power efficiencies of these tests are about 95% 
when compared with their parametric counterparts.'** We obtained 
exactly the same patterns of statistical significance, reinforcing the 
notion that parametric tests are robust when it comes to the as- 
sumptions of normality,'*^ 

Discussion 

A number of measures and methods (e.g., written tests, clinical 
skills exams, faculty’ reports) can provide data for assessment of 
students and curriculum evaluation. However, such data arc rela- 
tively particular in nature. Just as clinical outcomes researchers ob- 
tain patients’ perceptions to complement more objective measures 
of health,^' medical educators interested in the outcomes of curric- 
ular reform have gained important information by measuring stu- 
dents’ perceptions in the areas of well-being,"^ learning activity’,^ ^ 
learning environment,*^ and long-term effects.*’ This study’s find- 
ings indicate the value of gauging students’ perceptions regarding 
a variety of education goals as v.’ell. 

While there were statistically significant differences in impor- 
tance ratings for 25% of the educational goals, there was no trend 
in terms of directionality. Thus, our first (null) hy*pothesis received 
general support; the value students placed on the educational goals 
remained relatively stable between orientation w'eek and the end 
of the second year of the cuniculum. As shown in Table 2, our 
second hypothesis, which focused on progress estimates, received 
general support as well. Students who had progressed through the 
new curriculum reported more progress toward ten of the educa- 
tional goals than did students who completed the survey before the 
new curriculum was in place. All of the statistically significant dif- 
ferences in progress estimates were larger than any of the differences 
in importance ratings. This pattern of results was immediate'* and 
has been sustained over the years. It appears that the Patient, Phy- 
sician & Society (PPS) course, which extends throughout the firsr 
two years, contributes to increases in the students* perceived prog- 
ress toward their educational goals. More specifically, the PPS 
course emphasizes providing information to patients, incorporates 
peer assessment and feedback, and explores the political, economic, 
and social influences on health care. 

We were pleased to find that, when compared with the students 
in the old curriculum, the students who had experienced the cur- 
rent Ml -M2 curriculum reported more perceived progress toward- 
the goals of becoming more proficient at learning on their own and 
developing skills to enhance lifelong learning. We attribute this 
change to the adult-learner and active-learning approach taken by 
all four of the Ml -M2 courses. However, we did not see a similar 
gain in the area of identifying strengths and weaknesses in academic 
and clinical abilities, an important component of lifelong learning 
and mindful practice.^^ The results suggest that we also need to 
focus our attention on helping students learn to manage time more 
effectively and understand how the stresses of life as a physician 
will affect their personal lives, two goals voiced by faculty who 
developed the new Ml -M2 curriculum. 

Regarding the goal of developing skills for practicing health pro- 
motion and disease prevention, we are planning to move to a more 
clinically oriented PPS unit on health risks, in part because the 
students reported little increased progress in this area at the end of 
their second year. Finally, despite a wcll-rcccived first-year unit on 
physical examination skills in PPS, we observed a decrease in per- 
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ceived progress toward this skill set, a consistent and rather trou- 
bling finding over the years. We w'ill continue to work toward im- 
proving students* confidence and competence in physical exam 
skills within the PPS course, as the first and second years of medical 
school offer an opportunity to ensure a consistent approach to 
teaching and learning basic skills. Our aim is to provide a solid 
foundation that can be built upon during the clerkships. 

While it would have been preferable to collect the Student Per- 
ception Survey’s data for more than one cohort in the old curric- 
ulum, the survey could not be implemented until it was designed 
and tested. Still, the pattern of results is clear and consistent, and 
changes in progress estimates can be logically linked to changes in 
the curriculum. Further, results from other schools using the Stu- 
dent Perception Survey reinforce the findings regarding progress. 
For instance, progress estimates also increased at the University of 
Utah after a curricular revision. Interestingly, significant progress 
toward a similar number of goals was evident at both Northwestern 
and Utah, but the pattern of results (i.e., mix and magnitude of 
changes) differed. (We will be working with Dr. Neal Whitman 
and colleagues at Utah tp determine the extent to which observed 
changes reflect the emphases of M1-M2 curricular reform at that 
institution.) Students’ perceived progress tow'ard their educational 
goals did not increase at schools that did not make substantial 
changes in their Ml -M2 curricula during the period they have used 
the Student Perception Survey. 

Taken together, these observations highlight the generalizability 
and sensitivity of this approach to curriculum evaluation. The Stu- 
dent Perception Survey has proved a very useful tool for gauging 
the effects of curricular reform and identifying areas in need of more 
attention. We consider students* perceptions one important com- 
ponent of curriculum evaluation, and we will continue to monitor 
them carefully. At present, we are working to develop a question- 
naire for residency program directors and another one for medical 
school alumni, each of which will draw on aspects of the Student 
Perception Surv^ey. As noted by Gerrity and Mahaffy*^ this type of 
outcome data ser\'es the important function of documenting where 
we have been and helping us better understand where we are going. 

The authors thank Heather Sherman for her assistance in reviewing the literature 
related to this study and the RIME committee for helpful comments regarding the 
manuscript. 

Correspondence: Gregor^' Makoul, PhD, Associate Professor and Director, Program in 
Communication Sc Medicine, Northwestern University Medical Schixjl. 750 North 
Lake Shore Drive (ABA 625), Chicago, IL 60611; e-mail (makoul@nonhwestcm.edu). 
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An Index of Students’ Satisfaction with Instruction 

JAY H. SHORES, MICHAEL CLEARFIELD, and JERRY ALEXANDER 



Tlie purpose of this study was to determine whether a students’ 
satisfaction index (SSI), derived from responses to a single rating 
of a faculty' member’s overall instructional ability, is a reliably valid 
cool for identifying those medical school faculty members whose 
instruction is in need of improvement. 

Background 

Debates have been held since the 1950s on the validity of students’ 
evaluations of faculty (SEF).'"^ While opinion is split over the ap- 
plication of SEF to the management of the professoriate,^'** the 
majority of the researchers addressing this issue support the use of 
SEF in the areas of faculty development and instructional improve- 
ment. 

The primary premises underlying use of students’ ratings of the 
instructional abilities of faculty has been repeatedly addressed by 
educational researchers. The reliability- and validity of such mea- 
sures have been the subject of numerous studies,^'^*''^’'’ While their 
results have not been consistent, they have predominantly sup- 
ported the stability and validity of students’ evaluations of faculty. 
The use of a single global measure to assess instructional ability has 
also received the attention of the researchers, and it has been es- 
tablished that such a measure can be valid. 

Students evaluate every instructor at the Texas College of Os- 
teopathic Medicine. The items on the required evaluation form can 
be changed to correspond to the needs of academic departments 
and the unique characters of instructional programs. However, the 
final item on each and every evaluation form is constant across all 
courses. That item states “Overall, this is an effective instructor.” 
Responses to this question are made on a five-point scale and re- 
ported on a 80-point scale (strongly agree equates to an SSI of 100, 
agree = 80, neutral = 60, disagree = 40, and strongly disagree = 
20 ). 

In practice, the SSI works like grades assigned to students in 
graduate and professional training programs. Medical students sel- 
dom use the lov/er half of the rating scale. Their responses result 
in SSI scores from the 60s through the 90s, with a mean score of 
77. The SSI score is transmitted, along with responses to the other 
concepts assessed, to the faculty member and the department chair. 
Predictably, when an SSI drops below 70 the department’s chair 
begins to comb through other data on teaching performance to find 
out why the faculty member is not doing well. 

During the past 15 years faculty members have expressed a va- 
riety of concerns about the SSL Faculty members think the SSI 
reflects the students’ moods at the moment and, thus provides un- 
reliable results. They feel chat students are not capable of judging 
their instruction and that their peers would provide different (pre- 
sumably more favorable) results. They suggest that the students’ 
ratings correlate highly with the grades the students receive. Many 
feel that an assessment made at the end of the course disadvantages 
those who teach early. Each of these beliefs casts doubt on the 
reliability and validity of the SSL These are the questions that are 
addressed in this study: (1) Is the SSI reliable? (2) Is the SSI valid? 
(3) Is the SSI biased by grades? (4) Is the SSI biased by the time 
lag between performance and measurement? 

Method 

Eighty-five second-year medical .students in a five-month course in 
internal medicine agreed to evaluate quality of instruction for each 
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of the 124 lectures given during the course. The 24 faculty teaching 
the course agreed to end their classes ten minutes early each day 
to allow the students rime for evaluation. The departments of Med- 
icine and Medical Education agreed to have faculty members pres- 
ent to evaluate each lecture. As an inducement for participation, 
the Department of Medical Education generated individual for- 
mative assessments and suggestions to improve the quality of in- 
struction for each faculty member who taught for three or more 
hours in the course. These reports were delivered after all the data 
in the study had been collected. 

Data were collected follow'ing each of 124 lectures from the en- 
tire portion of the 85 students who attended the lecture. Faculty 
also evaluated each of the lectures in the course. Complete data 
were collected for 24 instructors. One instructor was removed from 
the study due to the onset of acute illness during instruction. The 
23 remaining instructors were each assessed by ten to 310 students, 
and by tuo to nine faculty members. One of the faculty' evaluators 
in each session was from the Department of Medical Education; 
the rest were from teaching faculty in the Department of Medicine. 
At the end of the course the normal instructor-evaluation process 
was ctmducted and a post-course SSI was derived at that time. 

Results 

Is the SSI Reliable^ To answer this question a test-retest was 
conducted. Students’ evaluations at the end of the course were 
compared with those obtained from the same students at indi\’idual 
lectures. In Figure 1, faculty have been ranked by their end-of- 
course evaluations. Figure 1 presents a mean end-of-lecturc SSI and 
a mean end-of-course SSI for each of die 23 faculty members. The 
end-of'lecuire SSI is the average response of the students who at- 
tended the lectures presented by the instructor. Tlie end-of-course 
SSI is the average response of the students who evaluated that 
instructor at the end of the course. The correlation between these 
two sets of satisfaction indexes is r = .847. 

There is some difference between the means (end-of-lecture 
mean = 83.21, end-of-course mean = 82.72). A paired i-test of the 
difference between the means of the measures was not significant 
(p ~ .647). The practical significance of observed differences de- 
pends on their interpretation. One pair of obsers^ed differences 
could cause an instructor to be viewed as a member of a different 
group. Students’ satisfaction with Instructor 2 shifted from mod- 
erate agreement to neutral. In the other 22 cases the instructors’ 
ratings remained in the same relative positions on the criterion 
scale. 

/s the SSI Valid? A concurrent validation was performed. Stu- 
dents* SSI sct)res were compared with those derived from the ob- 
servations of the faculty members. Figure 2 presents end-of-lecture 
ratings from students and faculty for each of the 23 instructors. The 
students' end-of-lecturc satisfaction index is the mean re.sponse of 
students who attended the lectures presented by the instructor. The 
faculty’s end-of-lecture satisfaction index is the mean response from 
faculty members who attended the same lectures. The correlation 
between these two sets of satisfaction indexes is r = .846. 

The measures had different mean values (students = 83.21, fac- 
ulty = 79.23). A paired t-tesr of the difference between the means 
of the measures was statistically significant (p = .012). These in- 
structors w'ere consistently rated almost four points lower by faculty 
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Instructor 

Fijjwrc 1. Students’ ratings of 23 faculty immediately following each faculty member’s lecture (broken line) and again at the end of the course (solid line). The faculty all 
delivered lectures at different points during the five'monthdong internal medicine course. 



than they were rated by students. On a practical basis, several pairs 
of these observed differences could cause an instructor to be viewed 
as a member of a different group. Instructors 3, 6, 10, 15, and 19 
could be viewed as less competent on the basis of their peers" rat- 
ings. 

Is the SSI Biased by Grades! Four non-cumulative examinations 
were given in the course. The grades received on the four exams 
were correlated with the average satisfaction index given by each 
student during the time block covered by each exam (n = 84). The 
resultant correlations were: r « —.01, .025, .089, and .090. There 
was no systematic relationship between grades and SSI scores. 

Is the SSI Biased by Lag Time to Evaluation’ End-of-course eval- 
uations of the 23 instructors in this study followed their lectures 
hy as much as five months. To assess the effect of lag time on the 
evaluations, each instructor was assigned to one of four groups 
based on the number of months that passed between his or her last 
lecture and the end-of-course evaluation. Table 1 presents descrip- 
tive data for the resultant groups. By subtracting the end-oficoursc 
satisfaction index from that obtained following their lectures, dif- 
ference scores were generated. A one-way fixed-effects analysis of 
variance of the difference scores compared across groups was not 
significant (p == .821). 

Incidental Findings. There is a ceiling effect in medical students’ 
responses to the question that forms the basis fi')r the satisfaction 



index. As a result, for both the end-of-lccrure SSI and rhe end-of- 
course SSI there was less than one standard deviation available 
above the mean response in 52% of the cases. This reflects the fact 
that distributions of SSI responses were negatively skewed 
(-1.289). 

The behaviors of the medical students changed during the study. 
TTiey came to the lectures in far greater numbers than they had 
before. In fact, a secondary issue the authors wanted to investigate 
regarding the effect of attendance on performance could not he 
addressed. There were too few students in the “non-attending” pool 
to do an analysis. 

Conclusions 

The SSI demonstrated sufficient reliability and validity in this pop- 
ulation of students and faculty' to support its use hy our institution 
as a tool to identify faculty members whose instruction is question- 
able. Tlie measure did not appear to be biased by either earned 
course grades or lag time to evaluation. The marked negative skew 
and ceiling effect observed in the data limit the application of the 
SSI to its intended purpose, flagging poorly-performing instructors. 
Differentiation among the instructors at the upper end of the SSI 
is practically impossible. 




Instructor 

Fipire 2. Students’ (solid line) and (acuity’s (broken line) ratings of 23 Instructors immediately following the lectures they delivered during a hve-monthdong internal 
medicine course. 
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Table 1. Students' End-of-course SatUfadion Indexes by Months of Lag 
from Lecture to Eeatuation' 



Lag Time from Lecture 
to End of Course 


Satisfaction Index 




Mean 


SD 


No. 


<1 month 


82.33 


9.64 


9 


2 months 


83.10 


15.40 


4 


3 months 


84.35 


3,33 


4 


>4 months 


81.97 


5.57 


6 


Total 


82-72 


8-68 


23 



’A total of 23 instructors gave 124 lectures during the five-month internal medicine 
course and all were evaluated by students in an end-of-course evaluation. 



Discussion 

The findings of this study support the assumption that a single item 
can be used to assess the global effectiveness of a faculty member's 
instruction.'' However, caution should be used in generalizing the 
findings. The strength that the student satisfaction index has dem^ 
onstrated may have been due, in pan, to its use with second'year 
medical students. Medical students are highly focused intelligent 
respondents whose backgrounds are academically homogeneous. 
The strength of the index is also partially attributable to the fact 
that the respondents commonly used only half of the scale’s values. 
This keeps the satisfaction index values for most instructors in a 
relatively narrow band and gives rise to a pronounced negative 
skew in their distribution. 

The concurrent validity of the SSI was assessed by comparing 
students’ responses with the responses of teaching faculty from the 
Department of Medicine and PhDdevel medical educators. The 
study was conducted in a medical school that routinely evaluates 
every faculty member in every course. Data were collected and an' 
alyzed by a department that the students had come to trust for its 
grading of their examinations and fonvarding of their assessments 
of faculty and courses to the administration. Both the students and 
the faculty* knew they were engaged in a study of the SSI. The 
extent to which these environmental and research variables af- 
fected the results of this study will he known only when researchers 
subject the SSI to further analysis. 

Finally, an ethical issue may be raised by the SSL The SSI was 
developed as a flag to inform the faculty member and department 
chair that there could be a problem in the area of instruction. As 
such, it does not tell the user why the respondents feel that the 
instruction they have received is suspect; neither does it give guid- 
ance to assist in remedying the problem. It is incumbent upon an 
institution to develop the means for both accurately identifying the 
nature of the problem'”''" and also addressing it productively’^'^^ 
before it elects to discover which of its faculty has the problem. 



Correspondence: jay H. Shores, PhD, Department of Medical Educaiion. University' 
of Nortli Texai HSC, 3500 Camp Bowie Boulevard, Forth Worth, TX 76107. 
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Modeling the Effects of a Test Security Breach on a Large-scale Standardized Patient Examination 
with a Sample of International Medical Graduates 

ANDRE F. DE CHAMPLAIN. MARY K. MACMILLAN. MELISSA ]. MARGOLIS, DANIEL j. KLASS, 

ELLEN LEWIS, and SUE AHEARN 



Score validit\’ is of central concern to any organization or school 
involved in high-stakes testing.* Validation research entails clearly 
identifying the purpose for which rest scores are to be used so that 
appropriate empirical evidence can be gathered to substantiate the 
intended score-based inferences." The validity of these score-based 
interpretations can be weakened by several test-related phenomena, 
including breaches to the security of the environment. The impacts 
of various forms of test security breaches need to be clearly ad- 
dressed to determine the extent to which a priori knowledge of 
materials might provide an undue advantage to subgroups of ex- 
aminees. This evidence also ensures that misinterpretation of scores 
is minimized on the part of the user. This task is especially crucial 
with performance-based tests such as standardized patient (SP) ex- 
aminations, given the typically limited nature of case hanks, the 
long exposure of items/cases, and the high costs associated with 
developing these types of assessments.^ 

Impact of Security Breaches on Test Performance 

The literature devoted to assessing the impacts of various form.s of 
security breaches on the performances of students completing SP 
tests has reported mixed findings. Most investigations undertaken 
in this area have been aimed at determining whether mean scores 
on SP tests vsivy significantly when cases are administered through- 
out an extended interv’al. ranging from as little as several weeks'* 
to as much as an academic year.^ The authors of these studies have 
reported that mean station or case scores generally remain stable 
and that the reuse of identical cases, consequently, appears to have 
only a minimal impact on the scores of students taking the ex- 
amination at different periods of time throughout the administra- 
tion cycle.^ '"** However, other research suggests that the reuse of 
identical cases can yield an increase in overall mean score, prompt- 
ing a suggestion that the number of common cases he kept at a 
minimum across forms.’*'*^ Swartz, Colliver, Cohen, and Bar- 
rows‘*’‘" examined whether collusion among students did alifect 
overall SP test scores in a more systematized fashion by encouraging 
students who took the examination in the early stages of admin- 
istration to share as much information as possible about the cases 
with students scheduled to be tested it a later date. 'Hie authors 
found little evidence that information-sharing among students af- 
fected performance. 

It is important to underscore that those studies restricted their 
view of a test security breach ro various degrees of (presumed) in- 
formation-sharing among examinees. It can he argued that com- 
plicity among students, although a common form of a test-security 
breach, is probably one of its most benign manifestations. This is 
especially likely with low- to moderate-stakes SP examinations, 
where students’ motivation to engage in information sharing is low. 
In a high-stakes context (e.g., in licensure and certification testing), 
dishonest coaching organizations and examinees might employ a 
host of illicit means to obtain and disseminate actual test materials. 
A study undertaken by De Champlain ct al*' did model the impact 
of additional, more severe forms of test-security IjreaeJiiJ on ex- 
aminees’ performances such as those that would nosufc Aom stu- 
dents’ having access to formal materials prior to taking the exam- 
ination, The authors reported that disclosing test materials. 



whether it be directly to a subgroup of examinees or via a dishonest 
coaching course, led to significant checklist performance gains for 
a sample of United States medical graduates (USMGs). However, 
the impact of disclosure on interpersonal skills (IPS) scores was nil. 
Although informative, it is important to point out that these find- 
ings were based on a small and homogeneous sample with respect 
to examinees’ medical education and clinical skill levels. As such, 
there is a need for this type of research to be replicated with a more 
varied sample of examinees, to obtain an estimate of disclosure 
effects that might generalize ro a more heterogeneous population 
of medical students. 

The purpose of the present study was to model the impact of 
disclosing test materials on SP examination scores with a sample 
of international medical graduates. Funhermore, it is hoped that 
ensuing findings will provide a practical estimate of expected effect 
size within the context of this t>pe of security' breach and with this 
population. 

Method 

Exflminaiion. In this investigation, the SP test asse.sscd the clin- 
ical (history taking, physical examination, communication) skills 
and IPS of physicians about to enter super\'ised practice. SPs are 
laypcople trained to portray one of a variety of clinical scenarios. 
Test candidates rotate through these scenarios (or cases) and en- 
counter patients in a setting intended to reflect an ambulatory care 
clinic. Case-spccific checklists are used ro assess examinees’ clinical 
skills. Tliesc checklists are composed of dichotomously scored 
items, each of which represents a single action that is expected to 
be done by the student. A percent-conect score, corresponding to 
the number of actions done by the student out of the total number 
of behaviors listed in a given checklist, is computed for all en- 
counters. IPS are assessed with the Patient Perception Question- 
naire (PPQ), a case-independent inventory that is composed of six 
five-point Likert scab items. A percent -correct PPQ score is also 
computed and reported to each student for all encounters. Bc^th 
measurement instruments are completed by the SP following each 
15-minute encounter with the student. The same ten cases (chosen 
from the available pool) were administered to all examinees. The 
cases were selected to reflect the majority of cells contained in the 
test blueprint with regard to both skill and content domains. 

Scoring Procedure. In this examination, two SPs were trained to 
portray each case. For any given case, the performing SP pK^rtrayed 
the actual clinical scenario with the examinee, whereas the mon- 
itoring SP observed che encounter as it proceeded on a video screen 
in a separate room. Each student’s final percent-correct checklist 
score reflected the consensus reached by the performing and mon- 
itoring SPs as to what constituted the appropriate response to each 
item. Videotape review was instituted to arrive at a consensus if 
two or more discrepancies per checklist were noted in any given 
encounter. Of the 9,625 checklisi item responses recorded (77 stu- 
dents X 125 checklist items across the ten cases), videotape review 
was necessary^ for 202 (2.10%). The PPQ percent-correct score was 
derived from the performing SP. 

Examinees. Scvency’-seven international medical graduates 
(IMGs), recruited from the Los Angeles mctrop^ilitan area, partic- 
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ipated in this study and were blinded to its purpose. All examinees 
were certified by the Educational Commission for Foreign Medical 
Graduates, i.e., they had successfully passed the following exami^ 
nations: Step 1 and Step 2 of the United States Medical Licensing 
Examination and a test of English-language proficiency. The ex- 
aminees were paid for their participation and randomly assigned to 
one of two testing conditions: control or security breach (SB). The 
testing envitonment for examinees assigned to the control condi' 
tion (n “ 32) was representative of a “normal” assessment situation 
(i.e., participants received routine prior information about the test 
but no materials from the examination). In the SB condition, we 
attempted to model a situation in which actual case materials were 
disclosed. Examinees in the SB condition (n =45) were directly 
provided with the checklists for five of the ten cases to be seen 
(refened to as the exposed cases) as well as the PPQ, and were 
given one to two hours to review these materials prior to complet- 
ing the test. Information pertaining to tl\e five non- exposed cases 
was not disclosed to any of the examinees participating in this 
study. Cases included in the exposed and non-exposed sets were 
matched with respect to the main areas of this SP test’s blueprint. 

Analyses. Two separate analyses of covariance (ANCOVAs) 
were undertaken to compare the performances of the two groups 
on the five exposed cases. For both models, the condition factor 
(control or SB) was treated a? the independent variable. The mean 
percent-correct checklist score on the five non-exposed cases was 
treated as the covariate in the first ANCOVA, while the mean 
percent-correct checklist score on the five exposed cases was 
deemed to be the dependent variable (DV). In the second analysis, 
the mean percent-correct PPQ score on the five non-exposed was 
deemed to be the covariate, whereas the mean percent-correct PPQ 
score on the five exposed cases was treated as the DV. 

Results 

Mean scores and standard errors on the five exposed cases for ex- 
aminees assigned to each of the two conditions, adjusted for initial 
differences in ability between groups, were as foFows: 

For examinees assigned to the control coiididon, 

■ the adjusted mean percent -correct checklist score was 54-53 (SE 
= 1.48), and 

■ the adjusted mean percent-correct Patient Perception Question- 
naire score was 60.87 (SE = 1.18). 

For examinees assigned to the security breach condition, 

• the adjusted mean percent-correct checklist score was 59.95 (SE 
= 1.24), and 

■ the adjusted mean percent-correct Patient Perception Question- 
naire score was 67.03 (SE = 0.99). 

A significant group main effect was obtained in the first AN- 
COVA, F(l,74) 7.66, p = .0071. For the exposed cases, the SB 

group (adjusted M = 59.95%) significantly outperformed the con- 
trol group (adjusted M = 54.53) on the checklist. Similarly, the 
mean PPQ score for examinees assigned to the SB condition (ad- 
justed M = 67.03%) was significantly higher than the mean esti- 
mated for the control group (adjusted M = 60.87%), F(l,74) = 
15.84, p = .0002. 

Conclusions 

Results obtained in the present study with a sample of international 
medical graduates mirror those reported in previous research with 
USMG.s.‘' Disclosing checklist items led to significant performance 
gains for the examinees as.signcd to the SB condition. 'The^ain 
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noted in this investigation (5.4%), was, however, slightly lower 
than that obtained with a sample of USMGs. This is probably 
attributable to the larger number of ca.ses administered .in the test 
form (ten as opposed to six in the past USMG study). Therefore, 
the challenge posed to the IMGs was slightly more daunting, as 
they had to sift through ten cases to identify' the clinical scenarios 
for which they possessed disclosed materials and apply this infor- 
mation accordingly. Nonetheless, the gain noted would concretely 
translate itself into a 4.4'checklist-item disadvantage over five cases 
(slightly less than one item per case). This advantage might be 
inconsequential for most USMGs, who typically perform well 
above the cut-score on this ty^pe of examination.'^ However, it 
could significantly affect decision consistency for IMGs, whose 
scores tend to cluster in the vicinity of the pass/fail standard in a 
larger proportion. The control and SB groups also did differ signif- 
icantly with respect to their mean PPQ scores, a result that was 
not found with USMGs." Interestingly, the difference between the 
two groups (6.2%) was actually larger than the one resulting from 
disclosing checklist items. This could reflect a difference in inter- 
action styles that is culturally based. Disclosing simple indicators of 
IPS (such as the Likerc-scalc items found on the PPQ) ro SB group 
examinees yielded a mean score that was similar to that typically 
encountered w’ith U.S. medical students. It is also worth noting 
that the ty^pc of case that was most susceptible to the effects of 
disclosure appears to be population -dependent. For U.S. medical 
students, prior research suggested that cases involving largely me- 
chanical physical examination maneuvers were the easiest to mem- 
orize and consequently reflected the highest performance gains for 
those examinees with prior knowledge of materials. Divulging ma- 
terials for cases that primarily require communicarion and IPS in 
the interaction with the patient proved to be the most beneficial 
for our sample of IMGs. Again, these findings appear be indicative 
of differences in the way our sample of IMGs interacted with the 
SPs. These results suggest that providing a clear description of the 
examination and its goal to al! examinees prior to the administra- 
tion (in some form of information bulletin, for example) is neces- 
sary ro ensure a common understanding of expected behavior on 
the part of students. 

In summary, the results presented in this study provide further 
evidence that the secure handling of test materials is essential for 
all examinations, whether they be traditional in format or perfor- 
mance-based. Although the security breach modeled in this inves- 
tigation was severe (half of the test materials were directly exposed 
to students), steps can nonetheless be undertaken to minimize the 
likelihood of materials being disclosed. This, in turn, might lessen 
the impact of a security breach should checklists or other pertinent 
information fall into the hands of dishonest individuals. 

One obvious strategy that should be adopted with all SP tests is 
to clearly lay out the flow of materials and restrict access solely to 
concerned staff so that these individuals can be held accountable 
for receipt and safekeeping of this information. Delivering the mea- 
surement instruments via a computer nctw'ork also seems advisable, 
given the greater control that the latter medium can afford and the 
virtual elimination of a “paper trail.” The results of our study also 
point out the need to increase test development efforts to minimize 
the likelihood of a security breach. Increasing the piiol i>f available 
cases enables a more frequent rotation of forms within and across 
test sites, thu.s limiting the exposure rate for any given set. Finally, 
the use of modeled or cloned cases also seems desirable to increase 
the size of the case pool and thwart those individuals who may 
have mechanically memorized cases and accompanying materials. 
Modeled cases are defined as those presenting a similar opening 
.scenario but requiring a different work-up on the part of the stu- 
dent. Cloned ca.scs, on the other hand, call for a similar set of 
actions on the part of the student but present different contexts. 

Although informative, our results need to be interpreted in light 
of several limitations. First, the sample size examined wa.s small, and 
generalizations .should W made with caution. (Diir sample wa.s alsti 
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composed of IMOs who were perhaps atypical of the corresponding 
population, given that they had successfully fulfilled several U.S. 
medical licensing requirements (passed the USMLE Step 1 and Step 
2 and a test of English' language proficiency). Consequently, the ef- 
fect sizes reported in this study should probably be viewed as lower- 
bound estimates of what to expect in an operational testing context. 
Replication of this research with different groups of both IMGs and 
USMOs seems advisable. Tfiis research might also permit us to test 
the hypothesis that lower-ability students might benefit more from 
gaining access to materials than would those who are more proficient. 
From a test- development perspective, pursuing research that focuses 
on the identification of characteristics that make a case more vul- 
nerable to memorization would also be helpful. Finally, the findings 
reported in this study underscore the need to develop methods to 
detect breaches to the security' of the testing environment. Research 
aimed at assessing the usefulness of “tagged" checklist items and other 
means should be pursued.'*^ 

Testing organizations and medical schools should always be \-ig- 
ilant in guarding themselves against dishonest examinees and or- 
ganizations that may wish to compromise the secure nature of the 
testing environment. This investigation confirms past findings in 
that the psychometric ’properties of the SP examination described 
appear to be vulnerable to blatant disclosure of testing materials, 
it is hoped that the resu cs presented in this article will foster future 
relevanr research that w.-ll ultimately lead to the implementation 
of secure SP tests for licensure and other purposes. 

Cxwrespondencc: Andr^ F. De Champlain, PhD, Senior Psychomccrician, National 
Board of Medical Examiners, V50 Market Street, Philadelphia, PA 19104- 
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Assessing Post'Cncounter Note Documentation by Examinees in a Field Test of a Nationally 

Administered Standardized Patient Test 

MARY K. MACMILLAN, ELIZABETH A. FLETCHER, ANDRE R DE CHAMPLAIN, and DANIEL ]. KLASS 



The large-scale standardized patient (SP) test in this study assessed 
the clinical skills of fourth-year medical students in a series of clin- 
ical encounters targeting history" taking, physical examination, 
communication, and interpersonal skills. Yearly large-scale field 
tests have been undertaken over the past seven years in preparation 
for national administration. The study reported here was conducted 
in 1998. 

Students are oriented to the test prior to completing up to 12 
15-minute SP encounters (cases). Following each encounter, the 
SP records history elicited, counseling provided, or physical ex- 
amination performed using an objective checklist developed by 
expert clinicians. The checklists may be thought of as a process 
measure, serving as a reflection of actual behaviors demonstrated 
by the candidate. Inteq^ersonal skills are assessed using the Patient 
Perception Questionnaire (PPQ), a six-item instrument with a five- 
point Likert rating scale (uniform for every’ case). 

Following each encounter, students are given seven minutes to 
write a free-response Post-Encounter Note (PEN) (either a list of 
significant positive and negative history and physical findings or a 
written chart note documenting findings and counseling). The PEN 
is specifically tailored to reflect each case. Tliere is ro limit to the 
number of findings students may write. Patient management (di- 
agnosis or therapeutic plans) and interpretation of diagnostic tests 
are not assessed in these PENs. The PENs potentially reflect a can- 
didate’s ability to determine the most significant findings elicited 
from the encounter and to accurately record them. W\ni\ie numerous 
studies have examined the use of checklists with respect to fairness, 
.security, and accuracy, there is limited research investigating the 
psychometric properties of PENs. 

Previous studies have examined appropriate methods for scoring 
the PEN. Soliciting global judgments from experts seems appealing 
because scores are derived from the expertise of practicing physi- 
cians,’ but global ratings can be unreliable unless the scoring task 
is highly structured and extensive standardized training is pro- 
vided.^'^ From a national testing perspective, recruiting physicians 
to score the PENs for thousands of candidates may not be feasible. 
As a result, many researchers have favored the use of analytic keys 
to score PENs.^'"* A significant advantage of using such scoring keys 
is the fact that non-physicians can be trained to score the PENs 
with an accuracy level comparable to that of physicians.*"’ 

Research examining the usefulness of PENs with an SP test has 
suggested that these scores contribute valuable information to the 
assessment of clinical T'kills by providing unique information dif- 
ferent from that derived from checklist scores.® However, other re- 
search indicates that the chart audit scores should not replace the 
checklist entirely, since rhe information written by candidates in a 
simulated medical record may not provide a complete picture of 
events during an SP encounter.’ 

The inclusion of the PEN in an SP test is appealing. Fii-st, it is 
thought that PENs arc relatively immune to withtn-site and cross- 
site effects. Also, they do not depend on the accurate recording of 
checklists by SPs. Additionally, threats to security arc minimized 
because the PEN is a firee-response instrument and does nor reveal 
checklist content or other exam material. 

However, before the PEN can he used in large-scale testing, it is 
important to determine whether the PEN is a reflection of the 
checklist or whether the PEN contributes unique information about 



a student’s ability to synthesize and record medical information. 
The purpose of this study was, therefore, to investigate the rela- 
tionship between entries recorded in the PEN and actions captured 
on the checklist. It is hoped that the results of this study will help 
determine how to best incorporate PEN information into a com- 
posite score. 

Methods 

Measurement instruments. History taking, physical examination, 
and communication skills in this SP test were assessed using six 
case-specific checklists composed of dichocomously scored items. 
Each checklist focused on two of the three clinical skills and con- 
tained a maximum of 25 items deemed critical for that case by a 
panel of expert clinicians. The six cases used in the study were 
based on a test blueprint reflecting the challenges expected to be 
encountered by a medical student entering residency. Checklists 
and PPQs were completed by the SPs following each 15-minute 
clinical encounter. 

An SP monitor viewed the encounter on a video screen and 
completed the same checklist as the encounter unfolded. If two or 
more item responses on the checklist completed by the SP con- 
flicted with the responses on the checklist provided by the SP ob- 
server, a review of the videotaped encounter was required. The SP 
and the SP monitor reviewed the videotape together and discussed 
what constituted a correct response. The agreed-upon response be- 
came the key. 

A panel of expert clinicians was selected from a range of medical 
specialties. Each member of the panel had experience teaching 
third- or fourth-year medical students and experience with devel- 
oping or implementing SP exams for clerkships or other courses. 
The panel developed analytic scoring keys for the PENs of the six 
study cases hy using checklists, guides to the checklists, case pre- 
scriptions, and videotapes of each clinical encounter. These keys 
contained a list of significant findings or acceptable synonyms that 
the clinicians deemed essential for inclusion in a patient note. 

Exammees, The sample was composed of 80 fourth-year medical 
students from one northeastern medical school. The students were 
required to participate in the examination, hut the scores did not 
contribute to their end-of-ycar grades. 

A group of six medical chart abstractors (MCAs) was recruited 
to score the PENs of the 80 students. The MCAs were divided into 
three pairs and each pair was assigned to score the PENs for two 
cases. The MCAs were oriented to each of their assigned cases by 
listening to the ca,se summary and student instructions and hy 
watching a videotaped encounter of the case. The MCAs then 
scored an initial sample of PENs for each case using the keys de- 
veloped hy the panel. The MCAs were instructed to give credit for 
each finding (or acceptable s\Tionym) included in the PEN. Fol- 
lowing this “practice scoring exercise,” the pairs discussed their de- 
cisions and reached consensus. The MCAs then scored the re- 
maining PENs and completed a post-scoring sur^^y regarding their 
reactions to the process. 

Aiwlyses. Items on the scoring key were matched to content- 
equivalent items on the checklist by a test-development staff pro- 
fessional. Items on the scoring key without a matching checklist 
iu m and exemplars were not reviewed in the study. For each 
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matched scoring key and checklist item, the proportion'ohagree- 
ment rates (the observed agreement rate between a given behavior 
as measured by the checklist and the PEN) as well as kappa coef- 
ficients (an estimate of agreement above and beyond that expected 
due to chance) were computed. 

Results 

nie frequency distributions of the proportion-ohagreement rates 
and kappa coefficients for the six cases are shown in Table 1. A 
total of 68 item comparisons were examined across six cases. Over- 
all, 69% of the items had proport ion 'of-agreement rates that fell 
between .61 and 1.0. At the case level, 75% of the items in Case 
1, 92% of the items in Case 2, 69% of the items in Case 3, 56% 
of the items in Case 4, 100% of the items in Case 5, and 36% of 
the items in Case 6 had proportion-of-agreements rates that fell 
between .61 and 1.0. Across the six cases, 31% of the items had 
rates that fell between .21 and .60, and none of the items had a 
proportion-of-agreement rate below .20. Kappa values ranged from 
-.12 (Case 6) to .90 (Case 3). Kappa cannot be computed for 
items that are correctly answered by all examinees, hence, the total 
number of item comparisons across all six cases was 62. Using clas' 
sification guidelines proposed by Landis and Koch,'^’ perfect or ex- 
cellent agreement was achieved for 15% of the items, good or fair 
agreement for 42%, and slight or poor agreement for 44%. 

TTe average proportiorts of discordance between items on the 
checklists and the PENs are provided in Table 2. Column three in 
the table represents the average proportion of students who re- 
ceived credit for an item during the encounter (checklist = Y, or 
yes) hut did not write the findings in the note (PEN = N, or no, 
or errors of omission). At the skill level, the average proportions 
ranged from .08 to .31 for history taking, .09 to .33 for physical 
examination, and .09 to .39 for communication skills. Poor docu- 
mentation in the note does not appear to have been skill-specific. 
Column five shows the average proportions of students who re- 
corded findings in the note (PEN = Y, or errors of commission) 
that were not actually pursued in the encounter (checklist = N). 
With the exception of that for rhe history'- raking skill in Case 6, 
these average proportions arc considerably less than those in col- 
umn three. 

Discussion 

Proport ion-of-agreement rates uncorrected for chance were strong 
for four of the .six cases. Cases 4 and 6 exhibited moderate agree- 
ment. The proport ion-of-agreenient rates corrected for chance were 
poor to moderate. Cases 4 and 6, for which agreement was partic- 
ularly poor, required that rhe student use appropriate patient edu- 



Table 1. Frequency Dislribution of Proportions-of-agreement Rates (and 
Kappa Coefficients) between Checktlst and Post-Encounter Note (PEN) Hems 
in a Field Test of an Administered Standardized Patient Test, 1998* 





Proportion of Agreement '^ates 


Kappa Coefficients 




Oto .20 


.21 to .60 


.61 to 1.0 


Oto .20 


.21 to .60 


,61 to 1.0 


Case 1 


0 


2 


6 


2 


4 




0 


Case 2 


0 


1 


12 


2 


7 




3 


Case 3 


0 


4 


9 


4 


5 




3 


Case 4 


0 


7 


9 


10 


4 




2 


Case 5 


0 


0 


7 


1 


3 




1 


Case 6 


0 


7 


4 


8 


3 




0 



* Eighty fourth-year medical students at a northeastern medical school took the SP 
examination of six stations (15 minutes eaclt). The examination was required but the 
score did not contribute to the end-of-year grade. 

MtmciNE. Vol. 



Table 2. Average Proportions of Checklist-Post-Encountor Note (PEN) 
Discordance, by Case and Skill, of Examinees’ Performances on a Field 
Test of a Nationally Administered Standardized Patient Test, 1998* 



Average Proportion 



No. of Hem Checklist = Yes Checklist = No 
Skill Comparisons PEN = No PEN = Yes 



Case 1 



History 


3 


.22 


.00 


Physical exam 


5 


.32 


.03 


Case 2 


History 


10 


.16 


.08 


Physical exam 


3 


.09 


.01 


Case 3 


History 


5 


.08 


.01 


Physical exam 


8 


.33 


.05 


Case 4 


History 


8 


.31 


.03 


Communication 


8 


.23 


.12 


Case 5 


History 


5 


.15 


,01 


Communication 


2 


.09 


.01 


Cass 6 


History 


7 


.19 


.23 


Communication 


4 


.39 


.01 



‘Eighty fourth-year medical students at a northeastern medical school took the SP 
examination of. six stations (15 minutes each). The examination was required but the 
score did not cor^tribute to the end-of-year grade. 



cation and counseling techniques and reassure the patient. Thus, 
students appeared to have difficulty synthesizing and appropriately 
recording the psychosocial elements of an encounter. It may be that 
the students simply did not recognize the pertinence of document- 
ing information related to patient education and counseling. Other 
studies have also reported poor documentation of items related to 
patient education.^ 

On the whole, however, poor documentation in the PEN was 
not confined to cases with communication components. Omitting 
from rhe PEN items that were actually pursued during rhe encoun- 
ter appeared to some degree in every case. In particular, the physical 
examination items in Cases 1 and 3 were not documented well, 
suggesting that the students had difficulty interpreting physical 
exam findings into the written word. Again, it may be that the 
students simply did not understand the significance of the infor- 
mation gleaned from the encounter. It is also possible that the 
students’ concept of an adequate note w'as less comprehensive than 
that envisioned by the test developers. Revising rhe student ori- 
entation to the test with an emphasis on the importance of the 
PEN and providing examples of appropriate documentation for a 
clinical encounter may improve documentation rates. 

Although it was less likely that the students recorded in rhe PEN 
findings that were not actually pursued during the encounter, such 
items occurred more often in Cases 2 and 6 for history taking, and 
in Case 4 for communication items. At first glance it appeared that 
the students might have fabricated findings, but closer examination 
of the items where this occurred revealed potential problems with 
the wording of the scoring key. For example, w'hile a checklist item 
is very specific, a single PEN item attempts to capture all possible 
synonyms; thus, overlapping concepts may he inadvcriontly lumped 
together. Identifying the discrepancies between the Uiecklisi and 
the PEN may be helpful for refining the scoring key. 

It is important to interpret our findings in light of limitaiions 
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inherent in the study. First, the analyses were conducted with a 
small sample (80 students per case) from a single medical school; 
thus, generali?ations should he tentative. Second, the study ex- 
amined only six cases, whicl "ovided a limited opportunity to 
sample across the test blueprint. Future research should addre.ss 
these limitations by expanding the focus of the study to several 
diverse schools. Increasing the number of cases and the sample size 
will also help to minimize measurement error. Finally, excluding 
items from the scoring key without a matching checklist item may 
have underestimated the students* abilities to document their find- 
ings. Within this study, the PEN scoring key was assumed to rep- 
resent the “gold standard’* for appropriate documentation. Students 
may have written in the notes items that would not be credited 
because the clinical experts did not deem them critical. Further 
content validation of the scoring key may be useful in addressing 
this shortcoming. 

Despite these limitations, the results of this study suggest that 
the PEN provides unique' information about students' abilities to 
document the gathering of information, understand the significance 
of the information gathered, and translate verbal information into 
the written word. Thus, poor concordance between the checklist 
and the PEN suggests that students may have limited skills to prop- 
erly synthesize, interpret, and record findings from a clinical en- 
counter. Given these results, test developers may be less inclined 
to treat the PEN score as simply a reflection of the checklist or 
redundant information, in view of the fact that the ability to record 
significant findings from clinical encounters and demonstrate un- 
derstanding of those findings is a critical function not only of med- 
ical students but also of practicing physicians. 
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• LICENSED TO PRACTICE 



Moderator: Dale Dauphinee, MD 



International Medical Graduates’ Performances of Techniques of Physical Examination, with a 
Comparison of U,S. Citizens and Non-U.S. Citizens 

STEVEN J. PEITZMAN, DANETTE MCKINLEY, MICHAEL CURTIS, WILLIAM BURDICK, and GERALD WHELAN 



Literature dating back over 25 years has documented and com- 
mented upon deficiencies in the performances of medical students 
and house officers in both techniques of physical examination and 
ability to detect abnormalities/'’ Many of these studies took place 
in U.S. teaching hospitals, and when the subjects were residents, 
the studies do not specify results for international medical graduates 
(IMGs). Yet in 1998-1999 25% of first-year house officers in U.S. 
postgraduate medical training programs were IMGs,'*^ whose clinical 
experience in medical school is considered more variable than that 
offered in U.S, and Canadian schools.’ A substantial number of 
IMGs are actually American and Canadian citizens acquiring their 
undergraduate medical training outside North America, particularly 
at the “offshore” schools, which have proliferated in the Caribbean. 
The quality of training in clinical skills in these new schools, which 
are not accredited by the Liaison Council on Medical Education, 
is largely unknown. In July 1998 the Educational Commission for 
Foreign Medical Graduates (ECFMG) implemented its Clinical 
Skills Assessment (CSA®), a new requirement for ECFMG certi- 
fication. Experience with a more recently created “physical exam- 
ination case” within the CSA has allowed us to measure skills in 
a selection of basic physical examination techniques among IMGs 
completing this high-stakes performance assessment, including 
non-U.S. citizens and U.S. citizens. Both groups were deficient in 
important skills. 

Methods 

Test Cose and Design. The CSA is a ten-station performance as- 
sessment using standardized patients (SPs). It is designed to measure 
capabilities in history taking, certain aspects of physical examina- 
tion* oral and written communications, interpersonal behaviors, 
and the English language. The typical case requires the examinee 
to assess a new patient problem by taking a focused history and 
performing what the examinee considers a relevant physical ex- 



amination. The examinee completes a “patient note” and suggests 
a differential diagnosis and a diagnostic plan. The SPs use checklists 
to document which expected elements of history taking and phys- 
ical diagnosis the candidate did, but the format cannot always dis- 
tinguish whether an examinee omitted a physical examination el- 
ement or attempted it but made an error. For this reason, the 
physician staff of the CSA, with the endorsement of its Test De- 
velopment Committee, created a “physical examination case.” The 
case scenario presents a young man who needs a pre-employment 
physical examination; the “patient” hands to the examinee a sim- 
ulated examination form that explicitly indicates the physical ex- 
amination components to be done (listed in Table 1). These ele- 
ments were chosen not to replicate an entirely realistic 
pre-employment examination, but rather to include tasks relating 
to a variety' of organ systems and to include some for which correct 
technique would likely be especially necessary' for detecting abnor- 
malities in actual practice. 

Each task, such as “auscultation of lungs,” “ophthalmoscopic ex- 
amination,” “deep tendon reflexes,” was broken down into one to 
four components or scoring criteria, each scored by the SP as 
"done” or “not done" by the examinee. For example, for ophthal- 
moscopic examination, a candidate can separately obtain a point 
for correctly instructing the patient, for using his or her right eye 
for patient’s right eye and left for left, and for bringing the instru- 
ment sufficiently close to the patient’s eye. Such criteria were 
largely based on techniques outlined in a standard textbook on 
physical examination that is used extensively both in the United 
States and in other countries.'^' Criteria for auscultation of the heart 
were minimal because we could not fairly expect special positioning 
and maneuvers in the setting of a screening examination for a 
young adult without complaints. More criteria might have been 
entered for some tasks, bur since the CSA intends to assess basic 
clinical skills, we aimed at the most essential techniques. Also, we 
did not want to impair SP recall with too long a checklist. 



Tabie 1. Performances of Non-'U.S.-Citizen and U.S.-Citizen International Medical Graduates (IMGs) of Selected Physical Examination Techniques 
Tested in One Case within the Clinical Skills Assessment (CSAcg)) of the Educational Commission for Foreign Medical Graduates, 

October 1999- January 2000 



Mean Score (95% Confidence Limits) 



Physical Examination Task 


No. of 

Scoring Criteria* 


(n = 318)“ 
(%) 


Non-U.S. IMGs 
(n = 247)* 
(%) 


U.S. IMGs 
(n = 71)‘ 
(%) 


Measure blood pressure 


3 


87 (83-92) 


84 (80-88) 


91 (83-98) 


Assess extraocular movements 


1 


83 (78-89) 


75 (70-80) 


92 (82-100)t 


Ophthalmoscopic examination 


3 


70 (65-74) 


60 (55-64) 


80 (72-88)t 


Percussion of lungs 


2 


76 (71-81) 


76 (71-80) 


76 (67-85) 


Auscultation of lungs 


3 


91 (88-94) 


90 (87-92) 


93 (88-98) 


Auscultation of heart 


1 


99 (97-100) 


99 (97-100) 


99 (96-100) 


Radial and dorsalis pedis pulses 


2 


80 (75-85) 


72 (67-77) 


88 (79-96)t 


Deep tendon reflexes 


4 


80 (76-84) 


74 (70-78) 


86 (78-93)t 


Whole case 


19 


79 (77-81) 


77 (74-79) 


87 (83-90)t 






* "Scoring criteria" are components of the physical examination task.rMwn scores v;ere averages over the groups of the proportions of criteria correctly met by examinees, 
t Difference in means between non-U.S. and U.S. IMGS is significant, p < 0.01. ^ 
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The case was carried out by one SP following intensive training. 
One author (S]P) validated the SP’s accuracy using simultaneous 
checklist scoring during “pilot” runs of the case. The SP was already 
considered by staff a rapid learner and accurate in his work, and 
showed good scoring concurrence during quality-assurance obser- 
vations in this and another case (less than 10% discrepancy). He 
has no physical abnormalities. 

The case is not used in every administration of the test. It is one 
of a group of “miscellaneous” cases chosen by a computerized se- 
lection program designed to achieve balanced forms while accom- 
modating the availability of SPs. From the perspective of any one 
candidate, the appearance of this case on her or his ten-case form 
was effectively random: the candidates were in no way prospectively 
selected. We report on the first 318 candidates who encountered 
this case on their form — 247 non-U.S. citizens and 7 1 U.S. citizens 
from October 1999 through January* 2000. The ratio of U.S. to 
non-U.S. citizens (.29) in this group turned out to be somewhat 
lower than the ratio (.44) for all 8,313 candidates tested by the 
eSA as of the date of analysis. The overall test scores in data 
gathering (history taking and physical examination) across all ten 
cases in their fomis for the 318 examinees in this study were similar 
to those for all candidates (Cl.s^u = 1,732, p = 0.08), suggesting that 
our cohort was representative. 

Analysis. For a task, such as “deep tendon reflexes,” comprising 
four scoring criteria, an examinee could obtain 0, 1, 2, 3, or 4 
points, expressed for each task as a percent-correct score of 0%, 
25%, 50%, 75%, or 100%, with a similar transformation used for 
tasks comprising fewer subtasks. We calculated the mean of these 
percent-correct scores for each task (e.g., deep tendon reflexes), and 
for the whole case (percentage of all 19 criteria done correctly), 
over all examinees in the cohort, and did the same for the sub- 
groups of U.S, IMGs and non-U.S, IMGs. Confidence intervals 
were also calculated. To compare the performances of non-U.S. 
IMGs and U.S. IMGs, we conducted a repeated-measures analysis 
of variance. Tlie eight physical examination tasks were the within- 
subject factors, and citizenship at start of medical school was the 
bervveen-subjects factor. Post-hoc analyses were conducted to de- 
termine whether differences in task scores between groups were 
significant. 

To better understand qualitatively the nature of frequently scored 
errors and omissions, the author (SJP) most responsible for design- 
ing the case and training the SP interviewed the SP and observed 
40 randomly selected tapes of actual encounters. The SP was asked, 
where appropriate, to recall the most common errors causing him 
to withhold a mark for a given scoring criterion (e.g., “palpating 
too high on the foot” for dorsalis pedis pulse). 

Results 

Table 1 shows the mean percentage scores for each task and for the 
whole case for all examinees in the cohort and for the U.S. and 
non-U.S. subgroups. The task main effect was significant (F = 
22.631, p < ,01), indicating that the tasks, averaged over the two 
groups, were not of equal difficulty. The weakest performance was 
in ophthalmoscopy, the strongest in cardiac examination (for 
which, as mentioned, the criteria were minimal). There was a sig- 
nificant group (bctwccn-subjects) effect (F - 14.325, p < .01), in- 
dicating that, averaged over the eight tasks (or the whole case), 
there was a statistically significant difference in scores between the 
two groups, Tlie U.S. IMGs obtained significantly higher case 
scores than did the non-U.S. IMGs. 

The group-by-task interaction was also significant (F = 4.126, p 
< .01), indicating that differences in performances between groups 
varied over the eight tasks. That is, the U.S. IMGs performed sig- 
nificantly better than did the non-U.S. IMGs for cxtraocular move- 
ments. ophthalmoscopic examination, locating radial and dorsalis 
pedis pulses, and deep tendon reflexes. 




Analysis of the scores for each scoring criterion within the eight 
physical examination tasks (nor presented here), a “debriefing” in- 
terview with the SP, and review of a sample of videotapes of en- 
counters revealed the following common technical deficiencies: 
clumsiness in properly placing and wrapping the blood pressure cuff; 
insufficient extent of induction of motion in testing eye movement; 
failure to use “right eye for right eye and left eye for left eye” and 
not bringing the instrument in closely enough for ophthalmoscopic 
examination; not comparing right with left at a given location on 
the thorax for pulmonary percussion; unfamiliarit^’ with the ItKa- 
tion of the dorsalis pedis pulse; tack of briskness in applying the 
reflex hammer and applications at incorrect locations. 

Discussion 

Our study looked only at proficiencies in some fundamental tech- 
niques of physical examination, not the ability to recognise and 
interpret abnormalities. Thus, performance levels below 90% of cri- 
teria met can he considered a cause for some alarm when obser\*ed 
in medical school graduates, or final-year students, intending to 
enter a postgraduate training program.*' While few prescribed and 
traditional techniques in physical diagnosis have been rigorously 
tested to determine whether they improve accuracy in detecting or 
excluding abnormalities, we used as criteria well-established meth- 
ods advocated in the most widely used textbook of physical diag- 
nosis. Furthermore, it is difficult to deny that little will be seen in 
the fundus by an examiner holding the instrument 10 inches from 
the eye, or that a meaningful interpretation of the deep tendon 
reflexes is unlikely to follow misapplication of the hammer. It is 
not too much to expect that every* new house officer on the first 
day of residency would he able to effortlessly and rapidly apply and 
use the sphygmomanometer in an urgent situation, yet our cohort 
of IMGs showed only an 87% level of proficiency in this skill. Of 
interest, McKay et al. tested Canadian medical graduates and found 
deficiencies in the technique of blood pressure measurement, 
though they used a more stringent set of criteria than ours.'^ 

The ophthalmoscopic examination warrants comment. Non- 
U.S. IMGs showed only a 60% and U.S. IMGs an 80% level of 
proficiency, a significant difference hut low score for both. Recent 
literature^*' and the observations of one of us (SjP) at the medical 
school where he teaches suggest a declining use of the ophthal- 
moscope among learners and teachers in American academic med- 
icine. Our results in this study hint that the situation is similar 
elsewhere. McNaught and Pearson, in the United Kingdom, found 
that ownership of an ophthalmoscope declined sharply after an 
“equipment grant” was discontinued.’’ While any conception of 
the core skills in physical diagnosis must evolve to match changing 
patterns of practice,’ arguably all general physicians and some non- 
ophthalmologic specialists should he able to recognize at least pap- 
illedema, the advanced optic cupping of glaucoma, and perhaps 
some of the findings associated with common diseases such as di- 
ahete.s and hypertension. Faulty basic technique, as evidenced by 
the IMG.s w’e tested, will both frustrate those trying to master this 
difficult element of physical diagnosis and impede accuracy. 

Why might non-U.S. IMGs have pcrfoirned less well in some 
tasks than U.S. IMGs? Tlic ECFMG elected to create and imple- 
ment the eSA based in part on the belief that clinical instruction 
among international medical schools is les.s standardized and more 
variable in extent than that offered by U.S. and Canadian schools 
accredited by the Liaison Committee for Medical Education,'" A 
majority of U.S. IMGs taking the CSA have attended one of the 
“offshore” medical schools. Students in these schools do much of 
their third- and fourth -year clinical rotations in U.S. hospitals and 
practices, and so may encounter the sorts of physical diagnosis ex- 
pectations tested for in the CSA. Candidates have the opportunity 
to try out the physical examination equipment available in our 
examination rooms before the examination begins. Staff have on 
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several occasions heard non'U.S. IMGs report that they had never 
used an ophthalmoscope or (more rarely) had seldom performed a 
blood pressure measurement. We are not aware, however, of any 
comparison of preliminary clinical skills instruction among U.S./ 
Canadirxn, “offshore,” and other international medical schools. 

We do not believe that the SP performing this case showed bias 
in favor of U.S. IMGs over noivU.S. candidates. Obviously our 
training program for SPs includes discussion of bias and the im- 
perative to avoid it. Also, by chance, the SP chosen for this case 
is himself a native of another country and speaks with an accent. 
Furthermore, our observations of a sample of encounters seemed to 
confirm the differences detected. 

Our study has limitations. It provides no comparison ‘of skills of 
IMGs with those of graduates of U.S. and Canadian schools, and 
we by no means intend to imply that the latter would not show 
some deficiencies — indeed, literature cited earlier suggests that they 
would. We were not able to assess all commonly used physical ex- 
amination taslrs, and such skills as rectal, pelvic, and breast ex- 
amination are not incorporated into the CSA, As noted, our phys- 
ical examination case yields little information on ability to carry* 
out a thorough cardiac examination appropriate to a symptomatic 
patient. Observations of videotapes revealed that occasional can- 
didates did not attend to the explicit instructions for the case and 
failed to attempt one or more tasks, though we do not think the 
resulting invalid scores would influence the overall results and con- 
clusions. 

This study has several implications. First, residency program di- 
rectors should be aware that some medical graduates entering their 
programs might not bring with them a full reperroire of fundamen- 
tal skills in physical examination technique; of course, our results 
apply only to graduates of medical schools outside the United 
States and Canada. It therefore may be desirable to assess selected 
clinical skills early in the first year and provide focused remediation 
for detected errors. Second, those responsible for clinical skills in- 
struction at the medical school level may need to also sharpen their 



focus on ensuring the acquisition of fundamental physical diagnosis 
methods before students graduate. Finally, the authors’ experience 
with this station supports the now widely accepted view that well- 
trained standardned patients can be used to assess ability in at least 
rudimentary techniques of physical examination. 

Correspondence and requests for reprints: Steven Peitsman, MI"^, ECFMG, 3624 Mar^ 
ket Street, Philadelphia, PA 19104: e-mail <speitzman@ecfing.org). 
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# PREPARING TO MAKE THE GRADE 



Moderator; Lynn Epsiein, MD 



Comparison of Three Parallel, Basic Science Pathways in the Same Medical College 

DAVID R WAY, ANDY HUDSON, and BRUCE BIAGl 



Since 1970, the Ohio State University College of Medicine and 
Public Health has offered medical students a choice between two 
basic science pathways, lecture discussion (LD) and independent 
study (IS). Since 1991 the college has offered entering students a 
choice among three pathways, LD, IS and problem-based learning 
(PBL). Most of the literature on implementing alternative basic 
science curricula has focused on the comparison of USMLE Step 
1 test scores between different curricular methods. The purpose of 
this study was to investigate outcome measures (other than USMLE 
test scores) such as student activities and achievement in clinical 
education, and affective measures of student and faculty satisfac- 
tion. Additionally, we sought to assess the effect of pathway choice 
on admission, and to determine the factors influential in determin- 
ing student pathway choice. 

Ours is the only medical school in the country' where entering 
students have a choice of three preclinical pathways, making it 
fertile ground for comparison of the effects of different curricula. 
Learning objectives, content material, and structure (organ-based 
organization) are very similar across all three pathways. The three 
also share faculty, staff, and administrative oversight. What differs 
across pathways are the teaching and learning methods. 

In 1997-98 the college formed a task force to study the benefits 
and overall desirability of maintaining the three preclinical path- 
ways. Specifically, the task force was charged to look at all three 
pathways in terras of their educational importance, student and 
faculty preferences, and participant satisfaction. 

Until recently, the traditional LD was the most commonly cho- 
sen pathway among the 210 matriculating students each year. Tlic 
primary' mode of teaching in this pathw^ay is large-group lecture 
supplemented with small-group discussions and labs. The IS path- 
way, established in 1970 as the first alternative to the LD, offers 
students the flexibility to learn on their own through the use of 
highly structured reading materials, computer-based materials, and 
diagnostic practice examinations. The PBL pathway, established in 
1991, emphasizes student-centered, self-directed learning. Unlike 
IS students, PBL students are introduced to basic science concepts 
through the analysis and discussion of clinical cases during small- 
group meetings. Students then work independently on learning is- 
sues that are defined by the group before coming back together to 
discuss their studies. 

Literature Review 

Like any educational innovation, both IS and PBL programs have 
had to prove their effectiveness as alternatives to the traditional 
lecturc-ha.sed teaching. Lecture-based teaching has existed primarily 
for its efficiency, not necessarily for its effectiveness. 

As medical schools struggled to develop alternatives to lectures, 
investigations comparing alternatives to traditional lecture curric- 
ula such as IS and PBL were reported in the literature. Such 
investigations have generally found little or no difference in 
examination scores or clinical performances when comparing lec- 
ture-based courses w'ith alternatives. Way et al. compared alterna- 
tive curricular approaches in one college and confirmed that no 
difference in average USMLE Step 1 .scores existed across alterna- 
tive basic science pathways when controlling for pre-matticulation 
differences.* 



The literature on IS in the health professions reveals the fol- 
lowing; 

1. There is little or no significant difference in learner perfor- 
mances as measured by examinations and patient care compared 
with traditional lecture-based curricula.*"'* 

2. IS offers both faculty' and students more flexibility and port- 
ability in learning when compared with lecture-based learning.* '^ 

3. IS promotes lifelong, independent learning, self-pacing, and 
self-responsibility in learning.*'** 

4. Students who participate in IS tend to pursue more research 
and full-time faculty positions than students in lecture programs.’ 

5. After start-up costs are accounted for, IS costs the same as or 
less than traditional lecture-based courses.^"''' 

The literature on PBL in the health professions reveals the fol- 
lowing: 

1. There is little or no significant difference in learner perfor- 
mances as measured by examinations or patient care compared with 
traditional lecture-based curricula.**''*'*’ 

2. Differences that have been reported generally indicate the 
same or less factual knowledge but better clinical performance and 
patient management for PBL students.*^"''' *'* 

3. Both faculty and students find PBL more enjoyable and prefer 
PBL to “traditionar lecture courses. *’‘*‘*'**'' *** *’’ 

4. PBL students tend to use “backward” reasoning (working from 
clinical information back to theory) when solving clinical prob- 
lems, whereas traditional students reason “forward” (from theory to 
clinical practice).”'*^''* 

5. PBL students have a greater tendency to use evidence-based 
medicine practices (more journals and literature searches) than 
“traditional” students.*'*'*^ 

Method 

This article reports part of a larger, more comprehensive institu- 
tional research project conducted by a task force of clinical and 
basic science faculty supported by consultants from the College of 
Medicine’s Office of Academic Services (OAS) for Medical Edu- 
cation. Both qualitative and quantitative data were gathered for 
this report using a variety of methods: document analysis, .survey 
methods, and interviews with key educational staff members. 

Annual reports daring hack to 1991 from each of the three path- 
ways were reviewed and summarized by task force members. Surv'eys 
for both students and faculty were developed, pilot tested, and sum- 
marized by task force members with help from OAS consultants. 
Surveys were administered in spring quarter of 1997 to all students. 
First- and second-year students were sur\'eyed in their respective 
class locations, as a group; third- and fourth-year students received 
paper copies in their college mailboxes. Return rates were much 
lower for clinical-year students due to clinical assignments and time 
of the survey. Faculty surveys were distributed through internal mail 
services to faculty with 50-100% academic appointments. Likcrt- 
cype survey items were analyzed using descriptive statistics: fre- 
quencies, percentages, cro.ss-tahulations, means, and standard de- 
viations. For reporting purposes “very .satisfied” and “satisfied” were 
combined into “satisfied,” and “very unsatisfied” and “unsatisfied” 
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were combined into “unsatisfied.” Documents, interview notes, and 
other qualitative data were analyzed using domain analysis of key 
words and phrases,’® 

Results 

Academic Outcomes. No difference across pathways was observed 
for graduation rates or grades on clinical rotations, but more IS 
students were in Alpha Omega Alpha (24% IS, 17% LD, 14% 
PBL) and higher percentages of both IS and PBL students received 
more departmental awards than did LD students. 

Student Survey. The student survey was designed to learn how 
students choose their pathways ar\d assess their satisfaction with 
their choices. The students were also asked to comment on their 
impressions of all three pathways. 

Of the 839 student surveys distributed, 467 usable responses were 
returned (55.6%), The return rate was biased toward the basic sci- 
ence classes (year one = 92%, year two = 76%, year three = 43%, 
and year four = 11%). Return rates by basic science pathway for 
each class surv^eyed resembled the proportion of students enrolled 
across pathways (LD = 69%, PBL = 17%, and IS = 12%). Because 
so few fourth-year students returned the survey, their data were not 
used. 

Having a choice of pathways was a significant factor in the stu- 
dents’ decisions to come to the college: 56% of the respondents 
agreed that choice of basic science pathway influenced their deci- 
sions to attend the school. 

Based on the students’ responses, the factors chat contributed to 
a student’s choice of pathway were learning style, experience with 
nontraditional learning methods, personal and family needs, and 
needs for socialization. Sixty-two percent indicated that the LD 
pathway was their first choice. Many students stated a preference 
for it because it is a method with which they were familiar. Some 
felt chat because of perceived weaknesses in their basic science 
backgrounds they needed the structure provided by LD. Social fac- 
tors that contributed to pathway choice were distance from campus, 
need for contact with students and teachers, and need to make 
friends and network. 

The PBL is the only pathway that caps enrollment at 35. 
Twenty-eight percent of the survey respondents (131 students) 
identified the PBL as their first choice of pathway; of these, nearly 
40% (52 students) matriculated into other pathways. Students stat- 
ing preferences for the PBL said that they either had had experi- 
ence with group work in the past or believed that through PBL 
they could learn clinical reasoning skills early. 

Nine percent of the respondents identified IS as their first choice. 
However, 12% reported participating in the IS pathway. Some stu- 
dents from the PBL wait list had chosen the IS pathway once it 
was determined that they would not be admitted into the PBL 
pathway. The students who chose IS as their first choice cited the 
flexibility of the pathway as their primary reason. This pathway 
tends to attract more nontraditional students such as older students 
with families, married students, or students interested in the MD- 
PhD program. Many stated that they would not have been able to 
complete medical school without the flexibility offered by the path- 
way. Others appreciated the opportunity to manage their own time 
by either accelerating or decelerating their pace through the basic 
sciences. 

Overall, student satisfaction with their basic science pathways 
was high: almost 82% were satisfied with their pathways; only 9% 
reported being unsatisfied. Across the three pathways, PBL students 
reported being the most satisfied (91%), and 93% of the PBL stu- 
dents would have chosen it again. The IS students were almost as 
satisfied with their pathway, with 86% reporting satisfaction, al- 
though only 76% said that they would choose ir again. The LD 
siudems were the least satisfied, with 79% stating that rhey were 
satisfied and only 63% said that they would choose that pathiray 



again. No difference across cohorts was observ'ed. The proportion 
of students expressing a preference for a given pathway was the 
same for each class: 42% said that they would pick LD, 41% said 
they would pick PBL, and 17% said they would pick IS. 

Overall, 52% of the students felt they had missed something 
offered by the other pathw'ays (54% of LD, 41% of PBL, 51% of 
IS students). Many LD students felt that they missed the clinical 
experience, case studies, and active learning that was offered by the 
PBL. On their own initiative, non-PBL students have started a 
case-study interest group in an effort to make up for this perceived 
need. Alternative-pathway students felt that they missed out on 
well-presented and organized material from concent experts, com- 
prehensive coverage, pressure to perform, and proper pronunciation 
of medical terms. 

The overwhelming response by students was that choice was very 
important and that students have different learning styles. They 
felt that choice attracts a higher caliber of students and shows that 
the school is a progressive medical school. Over 90% of the re- 
spondents agreed that the school should commue to offer multiple 
basic science pathways. 

Faculty Survey. All 568 faculty with 50% or greater appoint- 
ments were surveyed; 133 (23.4%) responded. Of the 133 respon- 
dents, 23% were from basic science departments, 48% from clinical 
sciences, and 29% did not provide their departments. 

Nineteen percent of the respondents reported no teaching ex- 
perience in any pathway. Sixty percent taught in only one pathway 
(LD 50%, IS 1.5%, PBL 7.5%). Fourteen percent of the respon- 
dents reported experience in two pathways (LD/IS 4 5%, LD/PBL 
7.5%, IS/PBL 2.3%). Seven percent participated in all three path- 
ways. 

The faculty respondents were generally satisfied with their stu- 
dent interactions in each pathway (54% of LD, 53% of IS, and 
87% of PBL faculty). The basic science and clinical faculty dis- 
agreed on the appropriateness of the distribution of their teaching, 
research, and service time: 80% of the basic science faculty were 
satisfied with the time distribution, while only 47% of clinical fac- 
ulty were satisfied. 

When asked, *in your opinion is it important that the College 
of Medicine and Public Health continue to offer three preclinical 
pathways?” the faculty responses of those who expressed an opinion 
were split almost evenly (38% yes, 39% no, and 22% no opinion). 
For the faculty who identified their departments, approximately half 
replied in the affirmative (47% of basic science faculty, 50% of 
clinical faculty), and 19% had no opinion. 

Discussion 

Based on student and faculty opinions from surveys and comparison 
of pathway outcomes for 1993 to 1997, the task force unanimously 
recommended thar the college maintain three basic science path- 
ways. The presence of three preclinical pathways provides the col- 
lege tremendous flexibility to accommodate student learning styles 
and time requirements. Students highly value the commitment of 
the college to medical education by accommodating their different 
student learning styles. Providing three pathways is also an impor- 
tant factor in the recniitment and admission of high-quality stu- 
dents. Differences in outcome measures are small and may at- 
tributed to higher pre-matriculation statistics for IS and PBL 
students. 

The three basic science pathways are Important in maintaining 
the positive image of medical education at the college. This i.s true 
both for current medical students and for those applying. Requests 
for the PBL pathway from entering students averaged 46% of the 
entering classes of 1994-1997, and IS enrollments have increased 
dramatically. Faculty are generally satisfied with student interac- 
tions in the LD and IS pathways, hut are most satisfied with their 
interactions with the PBL students. 
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Three pathways provide for differences in learning styles, as well 
as offering time for independent learning, research, and outside 
interests. Time flexibility hy pathway is greatest with IS, followed 
by PBL, and least with LD. Student satisfaction with their current 
pathways is very' high: 91% of PBL, 86% of IS, and 79% of LD 
students were satisfied with their basic science pathways. In spite 
of the high satisfaction levels, however, approximately half of the 
students felt that they had missed something in their pathways that 
was available in another pathway. Student comments indicated that 
this lack was not one of content material but rather in the social 
and pedagogic opportunities with faculty and other students. Eighty 
seven percent (87%) of the students agreed that the college should 
continue to offer three basic science pathways; only 5% disagreed. 

Low faculty response rates, lack of teaching experience in the 
pathways, and “no opinion” responses make it difficult to interpret 
the faculty survey data. Therefore, the task force recommended ed- 
ucating faculty about the importance of the three pathways and 
their recruiting and retention benefits. 

Conclusions 

The Ohio State UniversiU' College of Medicine and Public Health 
is well served by offering three parallel but alternative basic science 
curricula and will continue to do so. The large entering class size 
(210) and the three different pathways make the college fertile 
ground for comparison of alternative curricula. This study confirms 
previous conclusions in the literature about independent study and 
problem-based medical education in terms of outcomes, flexibility, 
choice, and student and faculty satisfaction and preferences. In ad- 
dition, it established that multiple curricula are important factors 
in admissions, educational reputation, and accommodating various 
student learning styles. 
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• PREPARING TO MAKE THE GRADE 



Moderator; Lynn Epstem, MD 



The Health Sciences and Technology Academy: Utilizing Pre-college Enrichment Programming to 
Minimize Post-secondary Education Barriers for Underserved Youth 

SHERRON BENSON MCKENDALL, PRISCAH SlMOYl. ANN L CHESTER, and JAMES A. RYE 



West Virginia is considered one of the most rural states in the 
nation, with over 60% of its population classified as rural.' The 
state experiences relatively high unemployment, and it ranks 
among the lowest (49th) of all states in median household income.’ 
Fifty'Cight percent of the students in West Virginia counties are 
eligible for free or reduced'price lunch. ^ Furthermore, only 14.7% 
of adult residents 25 years and over have attained a bachelor’s de- 
gree or higher,’’ putting the state 50th in higher education. 

The rural nature of the state coupled with economically de- 
pressed communities has limited the availability of secondary-level 
science courses required for ht alth sciences majors in college. Ad- 
ditionally, most counties in West Virginia are considered medically 
underserved, and therefore it is important to increase the number 
of health care providers in rural areas of the state.^ However, if the 
state’s under-represented students do not receive adequate prepa- 
ration in pre-college math and science, the proportion who can 
attend college and succeed will continue to be limited,’’ and the 
pool for the health professions will be too small. 

To overcome some of these barriers, West Virginia University 
and 2 1 West Virginia counties have come together in the Health 
Sciences and Technology Academy (HSTA) in a community-cam- 
pus partnership. Its web site is (http;//www.wv-hsta.org\). A pre- 
college enrichment program,* HSTA helps students learn tools to 
enable them to progress through high school, college, and profes- 
sional school. The HSTA program consists of an on-campus 
(WVU) Summer Institute at West Virginia University where stu- 
dents and science teachers are engaged in learning activities facil- 
itated by science and education faculty. These science teachers also 
facilitate HSTA community-based science clubs during the school 
year. The HSTA model uses the inquiry-based theory that encom- 
passes problem posing, problem solving, and persuasion.^'** Research 
suggests that inquiry activities emphasizing problem solving en- 
hance middle-level students* self-confidence in mastering science 
and their attitudes towards the discipline.'* Furthermore, inquiry- 
based learning is considered fundamental to students’ understand- 
ing of science concepts and processes. The National Science Edu- 
cation Standards (NSES) call for greater emphasis on “inquiry into 
authentic questions generated from student experiences [which] is 
the central strategy for teaching science. As a follow up to the 
NSES, a practical guide has been developed for educators who wish 
to emphasize inquiry-based instruction.” A principal thrust within 
the community science clubs is inquiry-based learning of science 
through extended investigations and community service projects.'^ 
The model also engages students in authentic learning processes 
(i.e., real-world problem-solving circumstances), which are both 
fun and challenging. Students’ projects often target health-re- 
lated topics and may potentially inform and l^enefit various com- 
munities through dissemination at local and state levels.” 

Methods 

HSTA’s effect on the academic success of its graduates and their 
decisions to pursue post-secondary studies and/or health sciences 
majors was assessed using quantitative and qualitative methods. . 
HSTA participants are selected based on at least two of the foBo^v * 
ing criteria: African American, financially disadvantaged, rural, and 
first generation aiming for higher eduaition. Participants are ad- 



mitted to HSTA during the ninth grade and participate in various 
activities until they graduate from high school, at which time they 
are considered HSTA graduates. There are 35 and 61 HSTA grad- 
uates for the 1998 and 1999 academic terms, respectively. 

Telephone interviews were conducted in the fall and spring of 
the 1999-2000 academic term. Graduates were asked a series of 
questions that employed a Likert-t>'pe scale regarding HSTA’s im- 
pact on pursuit of post-secondary study (1 = no impact to 5 = ver^^ 
high impact), choosing a health sciences major (I = no impact to 
5 = very high impact), preparation for college (1 = not at all pre- 
pared to 3 = extremely prepared), and preparation for major (I = 
not at all prepared to 3 = extremely prepared). The participants 
were also asked to briefly explain why they had rated the program’s 
impact and preparation levels as such. 

In the fall of 1999, HSTA’s impact on graduates’ college per- 
formances was assessed with an independent c-test comparing the 
mean (jx) grade-point average (GPA) of 40 HSTA students (ex- 
perimental) with that of 120 non-HSTA students at West Virginia 
University (WVU) in Morgantown, West Virginia. TTie 120 non- 
HSTA students were randomly selected from those enrolled at 
WVU with the same status (e.g. freshman), declared major, and 
residency. In order to achieve an effective sample size based on 
these characteristics (i.e., status, major, WV residency), three con- 
trols were matched to each experimental case. 

Results and Discussion 

Interviews. The return rates for questionnaires on pursuing post- 
secondary study and preparation level were 97% (93 students) and 
80% (77) for the combined cohorts (1998 and 1999). 

Post-secoTidary Study. TTie graduates’ responses about HSTA’s im- 
pact on their decisions to pursue post -secondary study indicate a 
strong impact, 3.88 and 3.96 (5 = very high) for the 1998 and 1999 
graduates, respectively. The graduates provided a variety of reasons. 
One stated, “Ekfore HSTA I didn’t even know I could go to college 
because I’m from a poor family, and they gave me the chance to 
go to college.” Another graduate affirmed that HSTA was the rea- 
son for her being in college. She posited that 

First of all, I didn’t think I would be able to go to college and I really 
— I didn’t have anybody in my family that said okay here is where 
I’m going, you ought to check this stuff out. When 1 came up here, I 

fell in love with the [WVU] campus And 1 got in and it’s nice 

to have contacts .... I knew a lot of the teachers and a lot of the 
faculty through HSTA and it really helped me out a lor. 

A graduate who intended to pursue a nursing career iepx)rted 
that “when I was in high school and worked with HSTA for the 
summer, we got to work with the cadavers. That’s hands-on expe- 
rience that I’d never have had.” Essentially, HSTA provides stu- 
dents with tangible experiences that bring excitement to the learn- 
ing process. Not only is the program a pipeline for participants who 
wish to pursue post-secondary study, it also provides financial sup- 
port for students who would not have had the opportunity to attend 
college. 

College Major. Approximately 66% (23) of the 1998 graduates 
and 80% (49) of the 1999 graduates chose health sciences majors. 
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Figure I. The Health Sciences and Technolog>' Academy (HSTA) graduates’ 
GPAs, their retention rates at West Virginia University compared with a non- 
HSTA group at the university, and their choices of major in higher education, fall 
and spring I999-20(X). 



Among these graduates, the impact of HSTA on this decision 
ranged from moderate to high, 3.60 (1998 cohort) and 3.74 (1999 
cohort) (5 = very high). The graduates rated the program highly 
because of the hands-on learning experiences it had afforded them. 
For example, one graduate stated 



College Preparation. The graduates responded positively regarding 
level of preparation for college and major as a result of their par- 
ticipation in HSTA. Of both cohorrs. 98% (94) were pursuing post- 
secondary study. In response to questions about college preparation, 
the mean responses were 2.45 and 2.46 (3 = extremely prepared) 
for the 1998 and 1999 graduates, respecrively. 

The graduates rated the program’s preparation for their majors as 
1.95 (1998 cohort) and 2.27 (1999 cohort) (3 = extremely pre- 
pared). Thus, the overall perception is that HSTA prepared them 
at least moderately for their majors. The higher rating given by the 
1999 graduates may he due to the higher percentage of them who 
intended to pursue health sciences majors. 

College Performance, The HSTA graduates had a significantly 
higher undergraduate GPA (population mean |p.l undergraduate 
GPA of 3.00) than the non-HSTA control group’s mean GPA of 
2.51. An independent t-test comparing the mean GPA of HSTA 
graduates with that of non-HSTA students at West Virginia Uni- 
versity proved that there is a statistically significant difference be- 
tween the GPAs (a = .05, p = .0014, t = 3.2495). The result 
exemplifies HSTA’s impact on those who matriculate to and grad- 
uate from the program. After the t test was peifonned, a 99% con- 
fidence interval of the true (i of the non-HSTA population was 
determined. The true p. of the non-HSTA population is between 
2.31 and 2.71. Thus, we arc 99% confident that the true mean 
GPA for the non-HSTA student population lies in the interval 
[2.31, 2.71). Therefore, the true p. GFV\ (3.00) of HSTA. students 
who attend WVU is not only higher than that of the control group 
(2.51) but also higher than that of the total non-HSTA population. 

Retention. All HSTA’s graduates who enrolled at WVU during 
the fall of 1998 were retained, compared with a rate of 78% for 
non-HSTA first-time freshmen (see Figure 1). Furthermore, an 
overwhelming 74% of 1998 and 1999 HSTA graduates are pursuing 
health sciences majors, compared with 26% of the graduates who 
have chosen other fields of study (see Figure 1). The graduates 
majoring in health sciences are particularly drawn to fields such as 
hiology/chemistr>', nursing, psychology, and allied health (see Fig- 
ure 2). 



Whenever \vc ... would do hands on experiences, it just made me 
more interested, especially in Psychology' because when w'e go to mess 
with the brains ... it just made me more interested. It made it not 
seem as hard or as bad as what people think it is. 

Another graduate reported that, “HSTA allowed me to see a lot of 
different areas in the health field that I wouldn’t have seen oth- 
erwise. It kind of gave me a taste of evcry'thing and just sort of 
oriented me.” Overall, these experiences not only expose students 
to various occupations of which they would other^vise have no 
knowledge, but it provides the opportunity to explore horizons 
within the realm of health sciences. 



Conclusion and Implications 

The Health Sciences and Technology' Academy provides a pipeline 
for underrepresented youth to pursue their higher education goals. 
Tlirough pre-college enrichment measures, HSTA gives students 
multifaceted opportunities for academic enrichment, which help 
them to realize that they can become accomplished individuals in 
their communities and in society at large. Pre-college programs such 
as HSTA can provide enriching experiences for underprivileged 
srudents who may not foresee the importance of completing high 
school and going to college. 

Although HSTA has not provided its graduates with enrichment 
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experiences beyond high school, it can be assumed that the success 
of these students at WVU can be attributed, in part, to the pre- 
college enrichment provided by HSTA. Many of the graduates at 
WVU and other higher education institutions express a deep sense 
of fulfillment as a result of their participation in HSTA. Further- 
more, many have expressed that their desires to pursue health sci- 
ences as well as technologic careers are due, in part, to the HSTA 
program. Their performance relative to that of non-HSTA students 
with similar interests is extremely encouraging. We believe that the 
HSTA model provides an exciting opportunity to extend inquiry*- 
based learning, via longitudinal science projects, beyond what oth- 
erwise would be possible in the science classroom. All evidence 
indicates that the long-term benefits of this pre-college enrichment 
program will be positive. 

CorresfKindence: Shciron Benson McKendall, PhD, Hcahh Sciences and Technology 
Academy. PO Bt^x 9026, Robert C. Byrd Health Sciences Ccnier, Morgantown, WV 
26506. Reprints arc not available. 
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• PREPARING TO MAKE THE GRADE 



Moderator: Lynn Epstein, MD 



The Mount Sinai Humanities and Medicine Program: An Alternative Pathway to Medical School 

MARY R. RIFKIN, KENNETH D. SMITH, BARRY D. STIMMEL, ALEX STAGNARO^GREEN, 

and NATHAN G. KASE 



In 1984 the AAMC report of the Panel on the General Profes- 
sional Education of the Physician' recommended that students pre- 
paring for medical school should strive for a curriculum that pro- 
vides a broad study in both the sciences and the humanities and 
that required courses should be kept to a minimum. One way to 
encourage premedical students to follow a truly broad liberal arts 
education would be to accept students to medical school early in 
their college careers, thereby alleviating the pressure to focus ex- 
cessively on the traditional science-based curriculum. Because there 
is no evidence to suggest that science majors are necessarily more 
qualified for medical school, we initiated an experimental program 
that encouraged humanities and social science majors to pursue 
their individual interests in college and to obtain a broad, maturing, 
liberal arts education. Such students might be expected to be less 
focused on the technology of medicine, bring different perspccth'cs 
to the practice of medicine, and simultaneously diversify the stu- 
dent body. 

In 1989 the Mount Sinai School of Medicine (MSSM) started 
the Humanities and Medicine (H&M) Program, an early-assur- 
ance-of-admission program designed for humanities and social sci- 
ence majors at a targeted group of five liberal arts colleges and 
universities (Amherst, Brandeis, Princeton, Wesleyan, and Wil- 
liams).' Students in this program are selected during the first se- 
mestei of their sophomore year in college. Admission into the pro- 
gram is based on a written application with personal essays, verbal 
and math SAT scores, high school and college transcripts, letters 
of recommendation, and personal Interviews. The students are re- 
quired to major in the humanities or social sciences and are re- 
quired to complete only one year of college biology and one year 
of college chemistry with a grade of B or better. 

Admission to MSSM is contingent upon successful completion 
of undergraduate studies, provided the GPA does not drop below a 
minimum of 3.0. MCAT scores are not required. In addition, stu- 
dents are required to spend an eight- week summer term at Mount 
Sinai after their junior year, during which they are exposed to clin- 
ical activities and complete a much abbreviated course on the prin- 
ciples of organic chemistry and physics relevant to medicine. Hous- 
ing and a stipend are provided. Students admitted to the H&.M 
program are under no obligation to attend Mount Sinai should 
their career choices change or another medical school appear more 
attractive. Also, the students have the option of deferring their 
admission to medical school for one year after obtaining the un- 
dergraduate degree. 

This study reports the outcomes of ten years’ experience with the 
H6eM Program. Our experience shows that although students in 
this program have more academic difficulties in the prcclinical 
years, they excel in the clinical/community setting and have greatly 
enriched the medical school environraent. This program demon- 
strates that success in medical school does not depend on a tradi- 
tional premed science curriculum. 



Method 



The achievements of all H&lM students (n = 85) matriculating at 
MSSM between 1991 and 199? have been compared with those of 
two matched cohorts of students who had been accepted through 
the standard admission process and had completed all stand.ird qre- 
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med science requirements. Students in each cohort were matched 
to the H&M students on the basis of year of matriculation, gender, 
age (within three years), category of educational institution (top 
30 liberal arts colleges or universities, taken from the 1998 L'S 
News & World Report Survey), and, w'hen possible, ethnicity, and 
were either humanities/social science majors or science majors. The 
groups of 85 students included students at different stages of their 
medical school careers and five classes of graduates (1995-1999). 

For each group, academic performance in medical school in both 
basic science courses and clinical clerkships and performance on 
the USMLE Step 1 examination were analyzed. In addition to these 
quantitative indicators of performance, we performed an analysis of 
the students* overall medical school achievements and contribu- 
tions to the medical school environment in terms of extracurricular 
activities, student leadership, and sers'ice, by evaluating their elec- 
tion to AOA and receipt of special awards. P values were deter- 
mined using the x’ test. 



Results 



The undergraduate science/math background of students entering 
MSSM through the H&.M program consists of one year each of 
biolog^^ and chemistry and a short summer course at MSSM, “Phys- 
ics and Organic Chemistry Relevant to Medicine.” This differs from 
the premed science/math requirements for all other students ma- 
triculating at MSSM, namely one year each of biology, chemistry, 
organic chemistry, physics, and math. The data in Table L show 
that a significantly higher proportion of H&.M students had at least 
one course failure in the basic science years than did the students 
with traditional premed science backgrounds, who were either hu- 
manities majors or science majors. Over 75% of the course failures 
of H&M students occurred in the first semester of year one, where 
there were nine failures in biochemistry, six in embryology, six in 
cell biology, and five in gro.ss anatomy (data not shown). Among 
the 20 H&.M students who failed one or more courses, nine stu- 
dents failed multiple courses, with the range being up to four 
courses. In the second basic science year, the proportion of H&.M 
students with at least one course failure decreased, with no single 
course having a disproportionate number of failures. 

Compared with their classmates, the H&M students had a higher 
failure rate on the USMLE Step 1 examination (Table 1), although 
all these students eventually passed it (data not shown). In an at- 
tempt to determine whether failure on the Step 1 examination 
could be predicted from data available at the time of acceptance 
into the H&M program, we analyzed the correlation of these stu- 
dents' SAT scores with their performances on the Step 1 exami- 
nation. Neither Verbal SAT (R‘ = 0.08) nor Math SAT (R^ = 0.07) 
scores correlated with the Step 1 examination score. However, all 
students who failed the Step 1 examination had Verbal SAT scores 
< 650. 

In the clinical years of medical sch(X)l, stati.stically significant 
differences in performance between the H&.M student.s, when com- 
pared with the matched cohorts, were Ics'^ evident. The failure rate 
of H&M students in clinical clcrLships ( lable I ), the garnering of 
clerkship honors, and election to A(DA (Table 2) wore not signif- 
icantly different from those of either marched humanities majors 
or marched science majors. In fact, the H&M students with mul- 
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Table 1. Performance of Humanities and Medicine Students Compared with Two Matched Cohorts in Preclinical Courses, in Clinical Clerkships, 
and on the USMLE Step 1 Examination. Mount Sinai School of Medicine, 1991-1998 





Humanities and 
Medicine Students 


Matched Regular 
Premed Students, 
Humanities Majors 


Matched Regular 
Premed Students, 
Science Majors 


P 


Basic science year one: students v;ith at least one course failure 


20 (85)* 


11 (85) 


2(85) 


<.001 


Basic science year two: students with at least one course failure 


10 (76) 


3(77) 


3(77) 


<.03 


Clinical clerkships: students with at least one clerkship failure 


6(76) 


2 (77) 


1 (77) 


<09 


USMLE Step 1: students failing on first try 


10 (76) 


2(77) 


3(77) 


<.02 



* Number In parentheses indicates total number of students analyzed, 
p values were determined by chi-squared analysis of the data. 



tiple clerkship honors were often the same students who had had 
academic difficulty in the basic science years or who had failed the 
Step 1 examination. Analysis of specific clerkships indicated that 
the HStM students excelled in the psychiatry and pediatrics clerk- 
ships (data not shown). 

In the preclinical years, Bc>ok Awards are given to those students 
who have performed outstanding extracurricular activity within the 
community or who have contributed time and energy in service to 
the institution. Over half the Book Awards were awarded to H&M 
students (Table 2). H&.M students are also disproportionately rep- 
resented on various subcommittees of the Student Council and 
other institutional committt'Js, as well as serving in large numbers 
as student group representatives to national organizations such as 
the American Medical Student Association, American Medical 
Women’s Association (AMWA), and Students for Equal Oppor- 
tunity in Medicine (SEOM). Furthermore, a greater proportion of 
H&tM students than students in the two marched cohorts received 
prizes and awards at graduation (Table 2). 

Additional data, not shown, indicate that the H&.M students 
completed medical school at the same rate and did not have a 
higher attrition rate than students entering medical school with 
more traditional premed backgrounds. Analysis of residency place- 
ments indicated that 77% of the H&.M students placed in univer- 
sity hospital-based programs, as opposed to affiliate hospital-based 
programs, as did 74% of the science majors cohort and 69% of the 
humanities majors. 



Discussion 

The Humanities and Medicine (H&M) Program challenges the 
long-standing belief that there is a necessary relationship between 
undergraduate science preparation and the successful completion of 
medical school and physician excellence. Students in this program 
are encouraged to use their time in college to pursue in depth their 
individual interests in their particular majors, which must be in the 
humanities or social sciences. They often spend considerable time 
in study abroad, independent research projects in their major fields, 
or extracurricular activities on campus, such as creative or perform- 
ing arts or journalism. These students thereby avoid premature spe- 
cialization and can obtain a broad, maturing, liberal arts education. 

The academic performance of H&M students at MSSM has been 
compared with the performances of two matched cohorts: matric- 
ulated students with the standard, required science course back- 
ground who majored either in the human ities/social sciences or in 
science. Since the medical school basic science courses arc all 
graded by a norm -referenced rather than criterion- referenced sys- 
tem, and all the other students had had at least two more years of 
science, including organic chemistry’, it is not surprising that the 
H&M students had more academic difficulties in the preclinical 
years than did the traditional premed students. However, in the 
clinical years and in the community setting, the H&M students 
were similar to the traditional premed students in gamering clerk- 
ship honors, institutional awards and prizes, and election to AOA. 



Table 2. Numbers of Honors and Awards Given to Humanities and Medicine Students and to Two Matched Cohorts, 

Mount Sinai School of Medicine, 1991-1993 





Humanities and 
Medicine Students 


Matched Regular 
Premed Students, 
Humanities Majors 


Matched Regular 
Premed Students, 
Science Majors 


Pt 


Clerkship honors/students 


0 honors grade 


7 


11 


16 


.12 


1-5 honors grades 


57 


51 


44 


.06 


6-10 honors grades 


12 


15 


15 


.77 


Alpha Omega Alpha 


14 (76)- 


9(77) 


15 (77) 


.08 


Book awards 


First year 


5 (85) 


2(85) 


1 (85) 


.21 


Second year 


9(77) 


2(77) 


4(77) 


.06 


Graduation awards/prizes, classes of 1995-1999 


21 (61) 


10 (61) 


13 (61) 


.03 



‘ Number in parentheses is total number of students, 
tp values were determined by chi-squared analysis of the data. 
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All the H<5yJvl students who failed clinical clerkships (n = 6) also 
had course failures in both of the basic science years, whereas none 
of the students in the cohort groups (n = 3) who failed clinical 
clerkships had course failures in both of the first two years of med- 
ical schcx)l. While the numbers are small, these data, together with 
other information about these students’ career goals and motiva' 
tion, suggest that this subset of H&id students may represent stu- 
dents not wholly committed to the study of medicine. There was 
no evidence in the undergraduate records of these students that 
could have predicted this pattern of failure. 

Although previous reports by others^’’* indicate that there is no 
slgniftciint correlation berween medical school performance and un- 
dergraduate major, the students in those studies had completed the 
required science courses of a traditional premedical undergraduate 
education. Our report on the performance of the H&dvl students, 
who have majored in the humanities or social sciences and who 
have lad minimal science education in college, indicates that, as 
might be expected, these students have significantly more academic 
difficulty in the basic science years in medical school than matched 
classmates w'ho have completed the traditional premedical curric- 
ulum. Moreover, we found that all H&.M students who failed the 
USMLE Step 1 exam had verbal SAT scores equal to or less than 
650. Thus, in an effort to minimize the number of students whom 
we might predict would have difficulty in medical school, we have 
decided to pay particular attention to the verbal SAT score in our 
admission process, as well as to scrutinize applicants’ high school 
science and mathematics achievements with care. 

The premise on which the H&M Program is based is that by 
eliminating the requirement for traditional premed requirements in 
college, students have more time to devote to their humanities 
majors and other pursuits and thus have time to broaden their 
backgrounds, which would be beneficial to their careers as physi- 
cians. These students bring to the medical school certain qualities 
and outlooks that positively impact lKc entire medical school com- 
munity. They have been among the founders of various musical 
ensembles, theater groups, and art exhibitions, as well as members 
of the executive board positions of MSSM chapters of AMWA and 
SEOM. The first woman president of the Student Council was an 



H6tM student. There is no doubt that the MSSM community has 
been enriched by the diversity of interests brought to the campus 
by the H&.M students. 

The studies reported here should lead us to reconsider the need 
for the traditional science courses as a prerequisite for success in 
medical school. Numerous published reports^" have questioned the 
emphasis on science knowledge in the selection of medical students 
and have suggested that studies in the humanities may enhance 
effective patient interaction and communication. By selecting 
highly qualified, intelligent students early in their college careers 
and allowing them to develop their curiosity in their chosen fields 
of interest, as well as involving themselves in community’ and ex- 
tracurricular affairs, we have shown that such students successfully 
complete medical school and excel in clinical activities. We intend 
to track these students as they complete their residencies and es- 
tablish their careers to be able to more fully evaluate their contri- 
butions. 

CoiT«poudcncc and requests for reprints: Mar>- R. Rifkin, PhD, Mount Sinai Schixjl 
of Medicine, Box 1475, New York, NY 10029. 
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• 1999 JACK MAATSCH MEMORIAL PRESENTATION 

The Epistemology of Clinical Reasoning: 
Perspectives from Philosophy^ Psychology, and Neuroscience 

GEOFFREY R. NORMAN 



Physicians’ clinical reasoning has been an active area of research 
for about 30 years. The goal of the inquiiy has been to reveal the 
processes whereby doctors arrive at diagnoses and management 
plans (although as Elstein correctly points out in his discussion of 
this paper,' the focus has been more on the former than on the 
latter) so that we could use this information to devise sp>ccific in^ 
structional strategics or support systems to make the acquisition and 
application of these skills more efficient and effective. Initially, 
these “clinical reasoning skills” were conceived as general, and con- 
tent-independent, so that they could be observed in all clinicians 
working through any problems. That is, they were thought of as a 
general mental faculty, presumably rooted in the architecture of the 
mind, which would be brought to bear on solving clinical problems. • 

However, the research findings did not support this viewpoint. 
Elstein and Shulman^ showed that whatever clinical reasoning was, 
it was definitely not skilhlike, in that there was consistently poor 
generalization from one problem to another, a finding that ulti- 
mately sounded the death knell for evaluation methods such as 
patient management problems. The past 30 years have seen an 
accumulation of evidence, in medicine and many other disciplines,' 
about the nature of the process, and shown the importance and 
centrality of knowledge. The central issue of this revised research 
program is achieving an understanding of how knowledge is ini- 
tially learned, how it is organized in memory, and how it is accessed 
later to solve problems. 

A second research program in medical decision making also 
emerged from research of the early 1970s. As Elstein discusses in 
the companion paper, this program “views diagnosis making as 
opinion revision with imperfect information." ' From the decision- 
analytic perspective, the be.st decisions arise from the application 
of a statistical decision rule to data; any other method is subopti- 
mal. Thus, the research agenda is directed to identifying areas such 
as medicine where humans function in a suboptimal way, and at- 
tempting to understand the strategics, the heuristics and biases, 
they apply to arrive at these suboptimal decisions. 

Elstein states that “it seems to me that decision theory is at least 
as promising as the study of categorization processes.” He may well “• 
be correct. But the two schools highlight a fundamental episte- 
mologic dilemma that the remainder of this paper addresses: Will 
we understand more about the nature of clinical diagnosis by fo- 
cusing on the diagnostician and striving to understand rhe mental 
processes underlying diagnosis, or by focusing on the clinical en- 
vironment and artempting to understand the statistical associations 
among features and diseases? To what extent is the world of clinical 
reasoning “out there” and comprehensible by understanding the 
relation between symptoms and diseases, and to what extent is it 
“inside" and understandable only by examining mental processes 
in detail? 

Further dilemmas face us as we examine the research in clinical 
reasoning. “Organization of knowledge” is viewed as a critical de- 
terminant of expertise in medicine. Rut it is nor really clear what 
is meant by organizafi(^n of knowledge. Is knowledge organized hi- 
erarchically with general concepts at the top, more specific scripts 
in the middle, and specific instances at the horti')m?^ Is it organized 



in nenvorks with nodes and connections,' as a symptom-by-disease 
matrix,'^ as propositions with causal links,' as collections of seman- 
tic axes,*" or as individual examples with no overarching concepts, 
as some of my earlier research claimed?'* 

A perusal of these various studies leaves the reader with only 
one overall impression — that the human mind is incredibly flexible 
and can organize and reorganize infomiation at will and seemingly 
effortlessly to give the researcher exactly what he or she wants to 
hear. It is no coincidence that propositional networks are disturb- 
ingly idiosyncratic and not apparently reproducible.^ My view is 
that all of these concept architectures arc produced on the fly at 
retrieval, in order to satisfy the expectations of the researcher, and 
none can claim special status as the way knowledge is organized. 
Do you want the clinician to tell you the probability that myocar- 
dial infarction (MI) will present with referred pain to the back? 
Can do. The nature of the neural pathways linking rhe heart and 
the upper arm? Sure. The hair color of the last patient they saw 
with an MI? Red. Given this incredible diversity of knowledge from 
specific to general, it seems likely that any attempt to uncover a 
representation of knowledge consistent with a particular perspec- 
tive from fairly directive probes will be successful; however, rhe 
ultimate form of this knowledge (if that is even an issue worth 
addressing) will remain elusive. 

Still, if the clinician’s mind is really that malleable, then this 
poses a serious challenge to rhe research tradition. Are there really 
any more “basic" or “primitive" forms of knowledge? Him- can we 
understand the nature of clinical reasoning if it appears to be this 
flexible? These were the questions that presented themselves as I 
reviewed the studies of clinical reasoning. As I thought about these 
issues, I began to explore other perspectives on the nature of knowl- 
edge and knowing from philosophy, psychology, and neuroscience, 
and started to identify common threads that, 1 think, can shed some 
light on these questions. As 1 did so, 1 found myself moving back 
and forth among three kinds of knowing, more or less from specific 
to general: 

1. How does the clinician come to know about diseases? How 
might diseases be represented in his or her mind? 

2- How do we as researchers come to understand domains of 
science, whether these are the diseases of clinical research or the 
workings of the clinician’s mind? 

3. What do we mean by knowing? What do we mean w’hen we 
say we understand something? 

, In the remainder of this article I roam freely among these levels, 
since many of the writings 1 uncovered inform all levels. But 1 must 
begin with a disclaimer. My jemmeys in this field are as an amateur, 
and arc receipt. 1 have been heavily influenced in my interpretations 
by two books. The first is Les.sons jrorn an Ol)tical Illusion, by Hun- 
derr,*''' who took the brave step of trying to find links among phi- 
losophy, psychology; and neuroscience. His goal was to place ethics 
in a context of these disciplines; mine is to turn these general truths 
to an understanding of clinical reasoning. A second major influence 
on my thinking is a hook called What is this T/u'ng Called Science! 
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by Chalmers’’ — a wonderful and readable review of classical phi' 
losophy and philosophy of science. I highly recommend both. 

The starting point of my discourse is a critical examination of 
the concept of disease. My intention is to use the exploration of 
disease as a case study of how we come to know about things. 

What Is a Disease? 

Through advances in biolog>', physiology, and molecular biology, 
we have come to a deep understanding of the mechanisms of many 
diseases. It seems almost nonsensical to now turn the clock hack 
and ask what a disease is. But this small departure m.ay serve us in 
good stead in understanding better what a concept is and how 
people identify concepts. 

Let's take two examples: 

■ Is syphilis a disease? Absolutely. It fits the medical model to 
perfection. A bacterium invades the host, stimulating a diversity 
of processes that ultimately are manifested in clinical signs. Osier 
said “understand syphilis and you understand all of medicine.” 
But there is a small historical glitch. Syphilis has been with man- 
kind for millennia and the signs and symptoms w’ere well estab- 
lished long before the bacterium was isolated. 

■ Is heart disease a disease? Yes. Put a label such as anterior myo- 
cardial infarction on it, and it looks even more like a disease. 
Bur likely we are all harboring the precursors of ischemic disease 
as cholesterol plaques slowly accrue in our arteries. So in a man- 
ner of speaking, the prevalence of heart disease approaches 
100%. Can w'e then still speak of it as a disea.se? And by rhe 
way, although there are many risk factors for heart disease, there 
is no clear cause. The same is true for cancer. We can easily 
identify' cancerous cells on pathology' slides, and we can correlate 
the clinical course with the accumulation of malignant lesions, 
hut we all have microscopic tumors in our thyroids, and a third 
of men who die of unrelated causes are found to have prostate 
cancer. 

All of these things seem disease-like because we can “explain” 
them at some lower lev'cl — plaques, bacteria, malignant cells. But 
there arc many other diseases listed in textbooks that have no clear 
causes, no microscopic correlates, no known mechanisms. And it 
is well to bear in mind that although anthropologists and historians 
have identified evidence of (for example) tuberculosis dating hack 
several thousands of years, and although old writings in medicine 
clearly describe the symptoms and clinical course of tuberculosis, 
the cause, the tubercle bacillus, was identified, by Koch, only as 
recently as 1884, and effective therapy has been available only 
since the 1940s. So the existence of a causal mechanism is hardly 
sufficient to claim that something is a disease. More generally, it is 
likely that exceptions to any definition of disease will he common. 

Campbell et al., in a classic article, “The Concept of Disease,” 
reported presenting clinicians and lay people with a scries of med- 
ical conditions and asking them whether or not they were dis- 
eases.’’ Perhaps not surprisingly, doctors were more prone than lay 
people to call things such as lead poisoning and tennis dhow dis- 
eases. But there was otherwise quite gotxi concordance, infectious 
diseases — malaria, tuberculosis, syphilis, polio — topped the list. 
Other common or serious medical problems — lung cancer, diabetes, 
multiple sclerosis, cirrhosis — came next. At the bottom were things 
such as hangover, senility, heatstroke, tennis elbow, and drowning, 
which had English, not Latin, labels. These authors concluded that 
the features that best predicted the labeling of a condition as a 
disease were that the condition (1) was associated with an abnor- 
mality of structure or hmetion (i.c., it had a “cau.se”) and (2) wa.s 
likely to he treated by a doctor. The latter was the stronger deter- 
minant, but regrettably, this seems tautological Since doctors are 
in the business of dealing with disease, describing a disease as some- 



thing that doctors deal with does nor, in my view, advance our 
understanding much. 

Let us consider the first predictor for a moment. Arguably one 
simplistic but functional view is that if a condition simply repre- 
sents a cluster of signs and symptoms (for example, carpal tunnel 
syndrome, low hack pain) it is less disease-like. Presumably this 
reflects a concern that a condition's features and associations among 
the features may be an illusory correlation (which humans arc par- 
ticularly giHxl at making)’' and not “real.” There is good reason for 
such a degree of skepticism. Historically, many syndromes that e.x- 
isted 100 years ago, such as self-pollution, have now disappeared, 
and there is every indication that many contemporary syndromes, 
such as chronic fatigue, sick-huilding syndrome. Gulf War syn- 
drome, and the myriad health problems believed to be caused by 
breast implants may go the same way. Conversely, the ability to 
explain disease through some underlying mechanism lends authen- 
ticity to it. Angina becomes much more believable if we can find 
narrowing of the lumen of the coronaiy' artery on angiography, even 
tho'dgh the association with the clinical manifestations is weak. 

The Role of Basic Science 

It wc view the identification of the features of a disease as analogous 
to the findings of an experiment (in this case, an experiment con- 
ducted by a malicious deity) then one basis for distinguishing a 
disease from a non-disease is the extent to which the features can 
be explained by a scientific theoiy. Thus the infcctiou.s diseases are 
explained by a noncontrnvcrsial, and historically verified, theoiy^ of 
host and parasite. Chronic diseases such as atherosclerosis arc a bit 
less discase-like since the theory' underlying them is less secure. 
And as we move to syndromes such as chronic fatigue syndrome, 
wc are less inclined to view them as diseases because no satisfactoiy' 
scientific mechanism has yet been found to explain their features. 

Turning ro clinical reasoning, investigators such as Schmidt’"’ and 
Patel,’'' in studying the role of ha^ic science in clinical reasoning, 
have found repeatedly that clinicians rarely invoke mechanistic ex- 
planations. But as Schmidt has s'nown, the fact that they need not 
invoke mechanisms does not mean that they do not know them — 
the knowledge is available hut is only rarely used. As he describes 
it, the knowledge is “encapsulated.” While basic science may play- 
only a minimal role in day-to-day-practice, it is arguably the only, 
or at least the major, route to understanding in this domain. Of 
course, basic science need not be restricted ro biology*. In the same 
way, the basic science of epidemiology wa.s fundamental to under- 
standing the transmission of AIDS, just as Snow in the 1880s iin- 
dersuxid the mechanism of cholera transmission (the Lxindon water 
supply) long before the bacillus was isolated. 

1 believe we can now posjt an explanation for the paradoxical 
findings of Schmidt and Parel. In the normal course of events, 
clinicians making diagnoses deal at the syndrome level, where the 
nature of the causal mechanism is irrelevant. The history and phys- 
ical exam are directed at revealing the syndrome-likc manifesta- 
tions, which then point to tests directed at the underlying pro- 
cesses, and therapy. The tcxtKx->ks of clinical diagnosis for “old” 
diseases probably have not changed much since Osier’s time. The 
signs and symptoms arc pretty well what they have always been, 
although of course some historic scourges — smallpox, diphrheria, 
cholera — are now nearly unheard of in the West, and others, such 
as AIDS, have taken their place. But despite the changes in our 
understanding of disease, the clinician attempting to make a di- 
agnoses is dealing almost exclusively at the syndrome level. Oc- 
casionally, some understanding of underlying pr(Kes.ses may help to 
sort out some conundrum, but one sjjspects that clinicians appear 
rarely to use basic science simply because their investigations of 
history and physical are directed to labeling the syndrome. Clinical 
reasoning reverts to a historically earlier fonn of the disease, fol- 
lowing the biologic dictum that ontogeny follow.s phylogeny — the 
fetus passes through all stages of evolution before birth. 
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Campbell’' elaborated the notion of disease in philosophical 
terms, describing two basic positions: the “nominalisr” perspective 
and the “esscntialist” perspective. In the nominalist view, a disease 
is simply a collection of abnormalities that appear to arise together. 
Thus the historical diseases of dropsy, consumption, and plague 
were recognized long before any causal agent was detected, al- 
though etiologies (such as “bad humors”) were advanced. Con- 
versely, the essential ist perspective presumes that the signs and 
symptoms arise from pathologic processes that can be identified and 
hopefully rectified. While it is tempting to place the.se two views 
in a historical order, the contemporary examples we have discussed 
indicate chat the two perspectives represent extremes on a contin- 
uum, which, as we shall sec, has parallels in both philosophy and 
psychology. 

What is a Concept? Lessons from Philosophy 

We can make some general observations about the concept of dis- 
ease. First, a disease, like any concept, does not exist entirely “out 
there” but rather, to some degree, is a mental construct. Second, 
the category' or concept called “disease” is nor an alhor-none 
proposition; rather, panicular exemplars have different degrees of 
disease- nes.s. Finally, it is aw'fiilly difficult to devise an explicit rule 
to aid in distinguishing between diseases and non-diseases. A rule 
such as “diseases are what doctors de.il with” works quite well bur 
is singularly uninformative. And we sense, without proof, that any 
rule we may devise is not going to be coldly analytic, but must 
have sub-rules such as “the more Latinesque it is, the more disea.se- 
like it is.” So ironically, while it is relatively easy to devise rules to 
determine whether someone has a particular disease (although I 
will go on to show that the rules are not the whole story), it is a 
lor harder to devise rules for the overarching category called “dis- 
ease.” 

These issues are not at all specific to disease, but rather are part 
of a large body of knowledge extending in space across at least three 
disciplines — philosophy, psychology, and neuroscience — and in 
time as far hack as Plato. To explore this further, I now venture 
(with considerable trepidation) into a more general inquiry into 
the nature of concepts. I begin by revisiting .some philosophical 
views on the nature of concepts. 

The origin of concepts has been, in some sense, a narure-nurture 
debate.*^ However, this argument has focused not on whether hu- 
man traits are inherited or learned (the usual spin on nature versus 
nurture), but rather on whether categories or concepts such a.s 
beauty, disease, table, or tree exist “out there” to be learned by- 
individuals as they develop and mature (which would suggest that 
an individual’s knowledge is formed from experience [nurture]) or 
are cs.sentially a pnxlucr of the mind (wc impose order and category' 
boundaries where none exists, as a result of the biological structure 
of the mind [nature]). A casual reading of any philosophy textbook 
reveals that this issue has been a central concern through the agc.s 
of the great minds — Plato, A^rrisrotlc, Descartes, Hume, Kant, etc. 
Let us briefly review the historical debate in mainstream philoso- 
phy, with a view to showing how' thinking in philosophy can help 
to frame our perspective on clinical reasoning. 

Modem philosophy began with Descarre.s, who emerges as the 
ultimate .skeptic, and whose views have retained central status as 
the universal straw man for all his successors. His famous statement 
“cogito, ergo sum” (1 think, therefore I am) has been a lodcsr<mc 
for philosophers and t-shirt makers flir three centuries. Regrettably, 
this idea has been almost universally misunderstood. Most interpret 
it as a statement of the ultimate rational man; our humanity is 
defined in terms of our capaciry' for rational thought. Unfortunately, 
the statement had a much more humble meaning fur Descartes. In 
continuing to question whether one could justify’ any external re- 
ality, to devise any conclusive argument tor the existence of objects 
such as dogs and rahle.s, Descartes was led to the desperate conclu- 



sion that the only thing he could he really sure of was his own 
thoughts. I think, therefore I am. 

The antithesis of this position was championed by the English 
empiricists Locke and Hume. Their view w'as that the mind was a 
tabiila rasa, a clean slate on which one's experience with the world 
was written. This interpretation seems perfectly acceptable for sen- 
sory experience, but is more difftculr to sustain for higher concepts 
such as causation, temporaliryy or, for that matter, disease. Humes 
resolution was to suggest that these notions emerged as a result of 
experience. 

Kant reframed the issue in a way that is central to our subsequent 
journey through psychology and neuroscience. He recognized that 
thoughts can occur only as products of interactions between the 
mind and the external reality of experience; we coruitruct experi- 
ence. He maintained a rigid boundary between those ptoperties that 
our minds bring to experience (which are hardwired) and those 
that emerge from experience. He eventually created a list of 12 
“primitives” — object, causation, temporality, and nine others — 
that he claimed the mind imposed on the world of experience. 

Hegel went one step further and recognized that the external 
world can influence the categories and labels wc apply. The cate- 
gories themselves do not emerge from our minds, but are influenced 
by the objects of our perceptions. The mind is not simply a clean 
slate upon which all experience is written in coherent form 
(Hume); nor is it the case that there is no uniform order in the 
outside world and that all concepts are mental inventions (Des- 
cartes); por finally does the mind impo.se fixed structure or con- 
structs on sensory' experience (Kant). Instead, the concepts and the 
content both grow and evolve (“become") as a consequence of the 
interaction between the individual and the environment. 

Finally, in this century', Wittgenstein extended these ideas fur- 
ther. He proposed that not only are concepts not fixed, they also 
are not definable by any set of logical rules. In pondering even 
commonplace concepts such as “dog,” he realized that any attempt 
to devise rules Is doomed. A dog has four legs — but if one is am- 
putated it’s still a dog. A dog harks — except an Egyptian Basenji. 
A concept — whether an abstract concept such as truth or a mun- 
dane concept such as dog, fork, or tree — emerges as a matter of 
“family resemblance.” Robins are more bird-like than penguins; ma- 
laria is more diseasc-like than alcoholism. Wittgenstein proposed 
chat concepts or categories arc derived from family resemblances, 
not from fixed sets of defining attributes. 

Thus the philosophy of concepts evolved from a Cartesian view, 
which is entirely intra-psychic and questions any external reality, 
and an empiricist perspective that presumes that all order and con- 
cepts exist as natural catcgorie.s to he di.scovercd by the human 
observer, to a Kantian interaction, in which the mind provides the 
categories or concepts and the external reality provides the objects 
to fill the categories, to a Hegelian perspective, which is much more 
organic, and in which thoughts and concepts themselves evolve 
and change as a result of interactions with external reality. Ulti- 
mately, we reach the perspective of Wittgenstein, which places 
even fewer constraints on concept.^, which are a matter of family 
resemblance and thus can he elaborated only through extensive 
experience with the worlds families. 

Applying these notions t<^ clinical reasoning, philo.sophy presents 
a larger framework in which to view our dilemma in defining a 
disease. To the extent that a disease is a concept, philosophy hut- 
tresses the middle ground hccwccn the notion that diseases exist 
entirely “out there” only to he discovered and learned and the 
notion that they are probably simply mental construers. Wc can 
then think of the concept of disease as arising from an interaction 
hcnv'ccn the thoughts of the pcrceiver and regular aspcct.s and as- 
.sociations of the environment. Further, .some diseases, such as sy-ph- 
ilis, arc more ccnrral memhers of the family; orhers, including the 
syndromes, arc more peripheral. 

As we shall see, this fotmuiarion finds remarkable support in 
research in both psychology ami neuroscience, to which I now turn. 
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What is a Concept? Lessons from Psychology 

One division in psychology has been preoccupied with the same 
issue as the philosophers: how do people learn concepts such as 
table, dog, or truth? But instead of relying entirely on reason for 
understanding, psychology' seeks evidence to understand how peo- 
ple create and learn concepts. Perhaps in the course of doing so, 
psychologists deliberately skirt some of the tough episiemologic is- 
sues that preoccupy philosophers. On the other hand, in my own 
reading, I was struck by how the one informs the other. A simple 
example: 

The Muller Lycr illusion/*' shown in Figure I, is pretty well 
known to all. We see the one vertical element as being longer than 
the other. Even though we can measure them and show them to 
be the same, the illusion is inescapable — a fine example of how we 
impose order (sometimes biased order) on the external world. But 
psychologists have gone further with this illusion, and questioned 
precisely u'fiy it is an illusion. In the course of doing so, they pro- 
vide a nice illustration of Hegel's interactive model of mind. One 
hypothesis is that it is an illusion because our minds are seeing it 
in three dimensions, so that the symbol on the left is seen as the 
outside comer of a wall nearest the viewer, and the one on the 
right is seen as the inside comer of a wall farthest away from 
the viewer. Although the two vertical lines arc objectively the same 
size, since the one on the left is seen to be nearer than the one on 
die right, the right one is “actually” longer. Dercgo\v.sky'' tested 
the illusion in Zulus, who spend their lives in round houses, and 
found that they did not see it as an illusion. So, it is not an illusion 
because our brains are “hardwired” to see it as such (unless Zulus 
have different hardwiring); it is an illusion because of the particular 
experiences we have had with the world. On the other hand, the 
illusion reminds us that our perceptions do not necessarily mirror 
reality, as they are also shaped by internal assumptions (in this case, 
about perspective and the inference of a third dimension from the 
two-dimensional representations on the retina) that sometimes lead 
us astray, 

A second example from psychology leads us closer to our central 
concern with clinical reasoning. Most of us have, at one time or 
another, wondered whether the “red” we see is the same as the red 
seen by the person beside us. While rhe differences in perception 
arc rarely likely to he as extreme as in the case of a childhood 
friend of mine whose color blindness was detected when he went 
to school and repeatedly drew green reindeer at Christmas, we have 
no real way of ever verifying the universality of “red.” Is it just a 
linguistic device, or a cultural norm.^ After all, at some time wc all 
had to learn, from o\it parents or friends, what red was. Perhaps it 
differs in different cultures. These questions, as they begin to cross 
the h<')undary bctw'ecn philostiphy, psychology', and learning, are of 
more than passing interest. 

Much of the fundamental work in concept foniuiiion has been 
done by Eleanor Rosch.'*’ One area she studied was how colors arc 
identified in different cultures. While, on the one hand, there ap- 
pear to he small cultural differences in the boundaries between 
colors (e,g., the Navaho have only one word for blue and green 
(no wonder, with all that turquoise jewelry around), Rosch 
showed that all cultures were unanimous in their choices of the 
host examples of red, yellow, or green. Even more interesting, Rosch 
discovered a primitive tribe, the Dani, who had words for “bright” 
and black only. She then taught them words tor colors, using Dani 
words (e.g., tree) that were unrelated to color. One grt>up learned 
the “primary” colors such as fire-engine red; rhe other learned Dani 
words for intermediate colors such a.s turtjuoise. The group learning 
red, yellow, and blue learned the asstKiativc words rapidly and ef- 
fectively; the other group never did master the asstx;iarions. Studies 
of this type provide support for the contemporary’ notion in phi- 
losophy that categories and concepts derive from our experience of 
the world; indeed there is surprising uniformity to these concepts 




Fz^rc 1 . Miillcr Lycr optical illusion. 



in precisely those areas where we might expect that experience 
(such as the experience of color) is also universal. 

Prototype theory was perhaps the first theory of concepts to be 
seriously applied to clinical reasoning. Bordage and Zacks‘** used 
many of the methods of Rosch to demonstrate that the same kind 
of graded structure that distinguished the natural categories w-as 
present in disease categories. They found, for example, that diabetes 
was a much more prototypical endocrine disease than Hashimoto’s 
disease or hyperthyroidism. It was volunteered more often hy prac- 
titioners asked to name as endocrine disease, recognized more ac- 
curately and quickly, and so on. 

These studies lead to two conclusions: first, there is evidence to 
substantiate our musings at the beginning of this talk that the con- 
cept of disease is a continuum, not a category. Second, rhe iden- 
tification of conceptual prototy'pes such as diabetes, carrot, and 
robin, which transcend different cultures, argues for an external 
“nurture” basis for concepts — even high-level concepts such as 
disease. 

Prototype theory’, in its methods, seeks evidence for cultural or 
even transcultural norms for categories. In rhe extreme, prorotype 
theory' might be view’ed as empirical evidence for a position that 
concepts and categories are derived entirely from univcfsals in the 
environment, a position more extremely nurture-oriented than any 
we have considered except the positions of Locke and Hume. 

Another psychological theory of concept formation, exemplar 
theory, while still holding to the implicit view that the concepts 
we learn reflect an external reality, is much more modest about the 
universality of such concepts. In this perspective, we arc able to 
identify a member of a class or a concept, nor because of any in- 
ternal rules or because the sum of our experience has created pro- 
totypes of the class that are available for analysis and introspection, 
but because we have, in any category’ (dogs, chairs, diseases, sports 
cars), an innumerable number of instances of the category (my dog, 
Rover, Lassie, etc.). When we arc iheed with a categorization task, 
a first line of defense is a search through memory’ for similar ex- 
amples of the class, and then, if we find an example that is suffi- 
ciently similar, wc assume the new beast is also a dog. This descrip- 
tion makes the process sound far more deliberate and a\’ailahle for 
introspecrion than the evidence suggests. Instead, if we inquire why 
a person decided that the new beast was a golden retriever, the new 
car was an Audi, or the skin lesion was actinic keratosis, the modal 
response would ho “Because it looks like a golden retriever,” or an 
Audi or actinic keratosis. Further justification may he forthcoming 
hut it sounds suspiciously post htK. This process is in facr unlikely 
to be available for conscious introspection. 

I and some colleagues have done a series of studies in dermatol- 
ogy*'^ and cardiology*' in which we have (ound c\'idence for this 
mode of processing. As one example,*'^ in a scries of experiments 
we gave subjects (residents) practice with a .set of dermatology slides 
covering 1 1 conditions, then subsequently tested them with a new 
set of slides. The slides w’crc carefully cht^sen. Each was drawn from 
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a quartet of slides containing two typical slides that strongly resem- 
bled each oiiicn and two atypical slides that resembled each other. 
Each subject was then tested with two other slides of the quarter. 
We balanced it ail off, so that wc could look at performances on 
rypical“similar, atypical-similar, typical-different, and atypical- 
different slides. Thus we deliberately compared typicality' (a prop- 
erty of the number of features and prototype theory) witli similarity 
(a characteristic of exemplar-based reasoning). The results showed 
effects of both similarity and typicality. With immediate testing, 
similarity resulted in a gain of accuracy of about 50%, rypicaliry a 
gain of about 12%. After ten days’ delay, slides that were similar 
to those in the initial learning series were diagnosed about 25% 
more accurately, and typical slides were diagnosed about 25% more 
accurately. 

We have continued to explore these phenomena. One concern 
is that it will work only with visually rich materials, where simi- 
larity is highly perceptual. Hatala*’ conducted a study with EGG 
interpretation, which, while still visual, is replete w'ith quantitative 
njlcs. In this study, similarity to an EGG in the learning phase was 
based entirely on a onc-linc description (e.g., a “54-year-old ac- 
countant” and a “middle-aged banker” versus “an “80-ycar-old 
widow”). To demonstrate the effect, the match was to an EGG that 
was visually similar, but from an incorrect and confusable category 
(e.g., left bundle-branch block and anterior Ml). When the de- 
scription was matched, accuracy was 23%; when it was unmatched, 
it was 46%, and of course more residents who saw the matching 
description fell for the incorrect diagnosis. Further, it would seem 
that the process must have occurred without awareness. If they had 
known they were matching on the age and occupation, they would 
not have done it, since a moment’s introspection reveals that this 
is irrational. 

Both c)f these psychological theories — prototypes and instances 
— “derive from a nurture view of concepts, namely that the concepts 
we learn are derived from our cxpetienccs. In fact, the exemplar 
models show precisely how specific experience*; arc available and 
used in subsequent judgments of category membership. However, 
as always, there is another .side to the story'. Psychology has been 
equally successful at deriving evidence to support the nature view, 
that what we see is influenced by our own minds. Admittedly, this 
is not a pure nature view, as we shall see, since the way our per- 
ceptions ot the exrernal world arc biased derives itself from our 
experience with the world. 

Cognitive psychology had its origins in an information-process- 
ing model based on the metaphor that the mind is like a computer. 
However, there was rapid accumulation of evidence showing just 
how un-computer-like humans are. One simple yet fundamental 
example is in information retrieval. The answers to questions such 
as “When did Columbus discover America?” and “What is the cap- 
ital of Arkansas?” are available almost as soon as you hear the 
question inflection. Second, if asked about Albania, not Arkansas, 
you would know that you didn’t know almost as rapidly. Contrast 
that with a search of rhe Web. Although the computer processes 
information at least a million times faster than does the mind, 
retrieval will inevitably take much longer. Further, it will take the 
computer longer still to decide that it doesn’t know, .since it will 
have to search every comer of its memory before it gives up. It is 
difficult to envision what kind of memorY architecture humans 
must have to do this job, hut it must he very different from the 
computer's RAM. 

One model of memory that accommodates those obsen^ations is 
called human associative memory. The model emerged from studies 
of reading coupled with a phenomenon called the word-superiority 
effect, which has relevance, surprisingly, to clinical reasoning as 
well as to many other domains. Imagine thar I flash a four-letter 
word on the computer screen for a few milliscctmds and ask you to 
identify rhe fourth letter. The phenomenon is this: when the fourth 
letter occurs in a real word such as “rink” or a pseudo-word such 
as “Sink,” the “k” is recognized faster and more accurately than 



when it occurs in a non-word such a.s “nrik.” While this seems 
perfectly plausible, it says some fundamental things about the na- 
ture of memory. Thar is, even ar the perceptual level of recognizing 
individual letters, a process thar must occur in milliseconds and 
without conscious introspection, identification is facilitated by 
memory of much higher- level concepts, the words themselves. This 
seems to illustrate beautifully the interactive nature of perception, 
showing that what wc see can he influenced by what we expect to 
see. 

The observations of the word-superiority effect were modelled by 
McLelland and Rumelhart“' using a “cinmectionist” or parallel dis- 
tributed processing (PDF) model, with multiple layers of nodes be- 
tween input and output corresponding to letter elements, letters, 
and words, with links among nodes at all layers. Unlike expert 
systems or Bayesian models, these connectionist models had no pre- 
programmed Riles: rather, they “learned” from experience, gradually 
building up strength among certain links connecting nodes. 

Parallel distributed processing models have been continually re- 
fined (and renamed — they are now more commonly known as 
“neural networks”), and have found application in many settings, 
including clinical diagnosis, where they appear to he more effective 
diagnosis machines than the traditional expert systems. However, 
for present purposes, these applications arc less important than the 
ohscr\'ation that the models have commonaliry with psychological 
\'icws of concept formation, based on learning from examples. And 
as we shall see, the new name is not simply good public relations 
— neural networks bear a striking resemblance to models emerging 
from neuroscience. 

I and my colleagues have taken the phenomenon that recogni- 
tion depends, in part, on available concepts in memory into the 
clinical reasoning lab. In a scries of studies in dermatology', radi- 
ology, and clcctrtxardiography, we biased the subjects by providing 
a brief history suggestive of a particular diagnosis, then showed 
them a visual stimulus — an EGG, a slide of a skin lesion, or a head- 
and -shoulders picture. We have consistently found that the bias 
influences not only the differential diagnosis (which might be 
viewed as perfectly rational), bur also the feature calls. Moreover, 
in a recent study using textbook examples of physical signs,^”* we 
showed that it was not simply a case that the history increased 
vigilance for chat particular sign, and therefore the likelihood of 
detection. Rather, an incorrect history led students to misinterpret 
one sign as another — the inflamed parotid glands of mumps became 
the m<X)n-shapcd face of Cushing’s disease, and the moon-shaped 
face of Cushing’s bccaine periorbital edema whtMi linked with a 
history of nephrotic syndrome. 

This phenomenon, that prior higher-level information either 
provided to the subject or available from memory can influence 
basic perceptual processes, has been demonstrated at all levels of 
expertise, from first-year students to cardiologists, so it is nor .simply 
a naive bias that can be erased with experience. LeBlanc’s follow- 
up studies ot strategies to "de-bias” subjects, under way in our lab, 
have shown that even fairly draconian measures are only partially 
successful; a finding that is nor surprising since perceptual processes 
are not available to conscious introspection. 

These findings, both in cognition of perception and in clinical 
reasoning, challenge a commonly held ^'iew thar expens use “for- 
ward reasoning”; that is; they begin with the facts of the case and 
reason inductively to a logical conclusion, a view championed by 
Grocn and Patel."' Their findings were derived from verbal intro- 
spections or written summaries, after the subjects had had time to 
read and reflect on the clinical case. It is my present view that the 
work on top-down processing, both in reading and reasoning, 
shows that dcducrivc pnKesscs from hypothesized solutions arc al- 
ready occurring long before the case is in full view, and that the 
apparent induction of the expert simply reflect.s a coherent story 
told post hoc. One study done by Eva*'' substantiates this view. He 
had subjects read mystery’ stories, than recount their solutions. Half 
told their solutions “online” as they were reading; the other half, 
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as a summary after. On three measures, the latter group looked as 
if they were doing substantially more forward reasoning. However, 
the manipulation took place after the reasoning was over. 

What is a Concept? Lessons from Neuroscience 

Finally, conspicuous in its absence from the discussion to date is 
the role of neuroscience in our understanding of concepts. I have 
described how cognitive psychology' has provided examples of phe- 
nomena that help us to understand some aspects of clinical reason- 
ing. Theories of concept formation and perception are a useful heu- 
ristic for testing apart aspects of clinical reasoning. But the skeptical 
reader could be forgiven for remarking that these theories seem 
more like useful demonstrations and analogies than real explana- 
tions, in a scientific sense. 

Let me then venture into what is for me the largely uncharted 
territory of neuroscience. In doing so, I am moving closer to the 
more traditional interpretation of the nature-nurture debate than 
the way 1 originally framed it. That is, we now seek evidence from 
neuroscience that the brain and its structures (nature) are respon- 
sive to, and modified by, the environment (nurture). Further, just 
as basic science provides a framework for understanding disease, 
neuroscience may provide a framework for understanding the pro- 
cess of concept formation and clinical reasoning. 

To advance the neuroscience argument, we need to discover ev- 
idence that categories ‘‘out there*’ can be localized to specific brain 
activities. Perhaps the most accessible argument about the impact 
of specific expetiences on brain anatomy and brain development 
emerge from the phenomenon of plasticity — the discovery that 
there are critical periods in the development of the brain during 
which input from the environment is required in order for specific 
facilities to develop. The phenomenon is ubiquitous. Here are some 
examples: 

■ Children who have congenital catatacts must have them surgi- 
cally removed before age 10, or they will be unable to recognize 
shape and pattern, although they will be able to learn colors. 
This was hypothesized to arise because of abnormal development 
in the visual cortex. Very recent research with newborns has 
extended this understanding further. Maurer studied children less 
than 9 months old who had had cataracts removed immediately 
following the surgery. Immediately after surgery, their vision was 
like a newborn’s — about 1/40 the acuity of an adult’s. But after 
only one hour of visual input, their acuity had improved to the 
level of a one-month infant. To quote the researcher: “It’s using 
the eyes and having the experience of seeing that’s driving the 
normal experience of vision after birth. . . . the brain was wired 
to be ready to teceive visual images . , . but it’s got to have the 
input in order to do the learning,” 

■ Animal experiments shov;ed kittens raised in an environment 
that only allowed horizontal or vertical orientations never 
learned the other. Hubei and WieseP^ then showed that these 
selective deprivations are identifiable in the development of spe- 
cific cells in the visual cortex. Tl'iey went on to show that the 
brain development was incredibly specific, so that a single day 
of exposure at day 28 w'as sufficient to establish the orientation. 
Other researchers have gone on to establish that plasticity is 
associated with the presence of specific proteins. 

The phenomenon of plasticity is direct evidence of an interac- 
tion between brain structures and the environment, and provides 
an explanation for the philosophical dilemma. Of course, such ex- 
periments do not provide direct evidence that higher-order con- 
cepts such as temperature, unemployment, love, or for that matter, 
tables, are associated with specific local changes. The next step is 
to move from the construction of perceptual maps of the environ- 
ment to conceptual maps in different areas of the brain. 'Jhismay 
not be as large a leap as it sounds; after all, the mcchanl^nwchat 



enables us to recognize Aunt Sally must involve links among the 
more primitive operators that isolate color, shape, and orientation. 
Thus, we move fi'om brain mappings corresponding ro perceptual 
inputs, which, as we have seen, develop and specialize as a con- 
sequence of interactions with the environment at highly specific 
developmental intervals, to mappings corresponding to the rela- 
tions among these elements — a “mapping of types of maps,” ac- 
cording to Edelman.^’ This remains a theory thus far, the theory of 
“neuronal group selection.” I cannot pretend to be more than an 
intrigued observer, but it would seem that the evidence at hand 
regarding neural plasticity provides plausible mechanisms for such 
a neural correlate of concept formation. Indeed, as I discussed ear- 
lier, although neural networks were devised as a simulation device 
to test a model of concept learning involving parallel and distrib- 
uted activation, there is a striking correspondence between the 
nodes and connections of neural networks and the proposed model 
of neuronal group selection. 

Conclusions 

This review was intended to accomplish no more than to place the 
current debates around clinical reasoning in a larger context. There 
is, in all this, a Michigan State University (MSU) connection. The 
small research program focusing on clinical reasoning was begun by 
Elstein and Shulman at MSU in the early 1970s. The McMaster 
group joined the fray soon after, with me as their hired hand. But 
soon after this first cycle of studies was completed, there was a 
strong divergence in the field. Elstein moved his interest to nor- 
mative approaches such as decision analysis, assuming that clini- 
cians were suboptimal decision makers who could be made more 
optimal with training. Others who followed, including Patel and 
Groen, while disagreeing on the details, retained a strongly ration- 
alist perspective. On the other side, Bordage pursued studies in 
protot>T?e theory', and I began a research program around exemplar 
models. It is only recently, with the study leading to this review, 
that I began to appreciate the historical origins of this divergence. 

The exciting conclusion from this review is that there appears 
to be a convergence among the three disciplines — philosophy, psy- 
chology and neuroscience — pointing to the reconciliation of these 
positions. While the constructs, the capacities for identifying reg- 
ularities appear innate, these abilities are directly responsive to the 
environment, so that each individual’s concepts will be both com- 
munal and idiosyncratic- Moreover, this synthesis has some prac- 
tical implications (believe it or not!). It appears to me that these 
thinkers are urging us to a reconciliation in our own field — exper- 
tise in clinical reasoning is neither mastery of analytical rules nor 
accumulation of experience, it is both. And the role of experience 
with individual examples in refining the concepts is critical. More- 
over, the philosophical work and the demonstrations of optical il- 
lusions show us that the external environment is not delivered to 
the senses intact, but is filtered through the prisms of prior expe- 
rience. These are important lessons for instruction in clinical rea- 
soning. 

The sum of these findings describes a model of clinical reasoning 
very' different from the algorithmic processes used by the computer 
(except when, using neural networks, the computer mirrors the 
mind). An evident implication is that there is little to be gained 
in demonstrating that humans are suboptimal Bayesians or algo- 
rithm-applicrs; they arc suboptimal because they arc using a sub- 
stantially different basis for computation. While, on the one hand, 
this provides a strong rationale for computerized decision-support 
systems, the cautionary note that pervades this review is that the 
support system cannot intervene after the data arc collected, since 
the data are themselves subject to interpretation in light of mental 
models. 

TTic J<fck M.iaiscb Memonai Presentation was sponsored by the Office of Medical 
Education Research and Devdoptnenr, Michigan State Universiry, and presented at 
the .annual AAMC-RIME mcetinq, October 27, 1^W9. 
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• 1999 JACK MAATSCH MEMORIAL PRESENTATION- 
RESPONSE 



Clinical Problem Solving and Decision Psychology: 
Comment on “The Epistemology^ of Clinical Reasoning” 

ARTHUR S. ELSTEIN 



Geoff Norman has presented an extremely rich and stimulating 
paper that surveys many important themes. In my response to his 
article/ 1 shall not comment on the connections he seeks between 
psychology, philosophy, and neuroscience, because this attempted 
synthesis is well beyond my area of expertise. Instead, my discussion 
focuses on two other issues: the status of research on the psychology 
of clinical problem solving, and the connections bewecn this re- 
search and decision psychology, the framework in which I have 
worked for the last 20 years. Then I consider the implications of 
this work for improving the quality of health care decisions. 

Status of Research on the Psychology of Clinical 
Problem Solving 

Several schemes have been put forth to explain how diagnostic 
reasoning is accomplished, including diagnostic categorization by 
instance-based recognition,’ protorypes,''* propositional networks/'* 
forward reasoning or pattern matching,’ and generating competing 
hypotheses/ Evidence supporting each of these models is available 
in the literature. How can this be? Norman argues that no single 
representation of the process or of the organi 2 ation of knowledge 
accounts for all of the phenomena investigators have encountered. 
Each account is correct sometimes, because individuals adapt their 
strategies to the demands of the task, including the demands of the 
experimenter. This implies that experiments designed to test par- 
ticular hypotheses have also, in some sense, been designed to val- 
idate the hypotheses or beliefs of the investigators. 

Norman and I agree that problem solvers are adaptive creatures, 
and we must be careful about concluding that any one account of 
their behavior v/ill explain all phenomena. He and his collaborator, 
Henk Schmidt, put it well: “There is more than one way to solve 
a problem.”^ Viewing problem solvers as adaptive thinkers trying 
to cope with complexity does not attribute malicious intent either 
to investigators or to research subjects. On the contrary, it hark.s 
back to Newell and Simon, who argued that because of the lim- 
itations of working mcmor>\ complex tasks are represented in sim- 
plified problem spaces, and that consequently understanding prob- 
lem solving is significantly advanced by understanding that 
cognitive representation. Tlieir view was quite radical for its time, 
for the concept of a problem space really committed us to the study 
of whar we now call problem representations or mental models. 

Different mental models might be employed by different subjects, 
iir the choice might depend on the task. It follows that a hierar- 
chical organization of medical knowledge, with general concepts at 
the top and specific instances at the bottom, is a plausible repre- 
sentation and is partially correct. So are propositional networks, 
with their nodes and connections, symptom-by-disease matrices, 
and semantic networks. In most studies employing each of these 
frameworks, the model finds reasonable support in the data. Nor- 
man argues that this fit occurs because the subjects, whether med- 
ical students, residents, or more experienced physicians, figure out 
how to adapt to the demands of the task, and these demands usually 
ask them to behave in ways that provide evidence for the models. 



This view owes much to Rosenthal’s research demand charac- 
teristics.” Within the domain of cognitive studies of medical rea- 
soning, I am nor aware of studies chat test the fits of different 
cognitive models to the same set of data, so we do not know which 
would fit the data best or how often each model is used. Studies to 
test competing models can and should be designed. 

Several prominent investigators in the field of medical cognition 
have used verbal reports of subjects thinking aloud either while 
solving a diagnostic problem or retrospectively to constnict repre- 
sentations of the problem-solving process. Norman notes that 
“propositional networks are disturbingly idiosyncratic and not ap- 
parently reproducible.” ' 1 cannot entirely endorse his view that “all 
of these concept architectures are produced on the fly at retrieval, 
in order to satisfy the expectations of the researcher.”' It is at least 
plausible that these “architectures,” like other blueprints, are plans 
for a constructive proce.ss; if one follow's a blueprint and a house 
or office building results, we should not be surprised. The plan was 
designed to lead to that output. 

Still, his caution is warranted. We should not unhesitatingly em- 
brace verbal reports as the solution to the problem of elucidating 
cognitive processes- Too much cognitive processing goes on be- 
neath the level of verbal report. And we agree that, to the extent 
that subjects adapt to the demands of the experimenter, they are 
likely to tell us what they think we want to hear. These objections 
imply that research that relies on verbal reports for basic data is 
not as likely to lead to “truth” as we would like to believe, and 
that we should move away from thinking-aloud methods back to 
traditional experimental psychology: the researcher should observe 
the relationship between the stimulus and the subject’s response, 
and ignore or distrust verbalizations about the task. Tlie subject’s 
response may he verbal, such as a diagnosis or a probability esti- 
mate, but a scientific explanation of the thought process should not 
be based on responses to such questions as “How did you know 
that?” or “Why do you think this is so?” If these questions are used, 
we should treat the explanations and justifications as data, not as 
true accounts of the operations of the subjects’ minds. 

Both Norman and I have taken these cautionary thoughts to 
heart over the years. Consequently, we have moved away from 
thinking-aloud accounts as the primary' data source and toward 
more traditional experimental methods (for examples, see refer- 
ences 12 and 13). We have done this despite knov/ing that ex- 
perimental studies will be criticized by clinicians on the grounds 
that they lack clinical verisimilitude and may not generalize to real 
clinical settings. A thoughtful clinician will surely ask this question 
about our work: “Even if I concede that physicians behave as you 
have shown in this experimental setting, what reason is there to 
believe that they would behave similarly when dealing with real 
patients?” Anticipating this question, Nonnan and his colleagues 
have worked extensively with visual stimuli, such as radiographs 
and EGG tracings, that arc unquestionably parr of the real clinical 
world.'^’” Bur this strategy begs the question, “Do the rc.sults apply 
to non-visual stimuli, such as are obtained in raking a g(xid his- 
tory?” My colleagues and 1 have diMio .some research using case 
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vignettes’*"^® that do not use thinking aloud to study clinical rea^ 
soning. One objection raised to our findings in those studies relates 
to motivational factors: clinicians are not motivated to do their 
best with hypothetical cases and would do “better” with real pa- 
tients. I think it unlikely that clinical problem solving will be better 
in complex environments, with many distractions, than in simpli- 
fied laboratory settings, but I concede that, just as with pharma- 
ceutical research, laboratory findings should be verified in the “real 
world.” Nobody ever said that doing good research would be easy. 

Decision Psychology 

Norman noted that my own research program moved in a different 
direction after 1980, from a focus on clinical “problem solving” to 
“decision making." What is the difference? For over two decades, 
much of the research on the psychology of decision making has 
been dominated by sratistical decision theory, a model of idealized 
rationality under uncertainty. Behavioral decision research has con- 
centrated on identifying systematic departures from this model, and 
these departures arc viewed as “errors.” The research has shown 
that while decision theory may be an account of ideal rationality, 
it is not a description of how people actually make judgments and 
choices under uncertainty. In short, limited rationality has its im- 
pact on both decision making and problem solving. The psycho- 
logical processes that produce these errors are called “heuristics and 
biases.” Indeed, the enrire line of research has come to be identified 
by this term.’’'" 

Norman argues that there is not much point in identifying cog- 
nitive heuristics and biases that violate the rules of statistical de- 
cision theory, since people are not trying to reach conclusions using 
these principles. To quote: “An evident implication is that there is 
little to be gained in demonstrating that humans are suboptinial 
Bayesians Or algorithm-appliers; they are suboptimal because they 
are using a substantially different basis for computation.”' 

In my judgment, Norman has misunderstood the research agenda 
of decision psychology and its implications for medical education. 
The study of clinical diagnostic reasoning from the problem-solving 
point of view implies one thinks of diagnosis as categorization. The 
research questions then center around issues such as, “What cate- 
gories does the problem solver know.^ What features justify placing 
the case in one category or another?” These are questions about the 
knowledge base and feature recognition and interpretation. From 
the decision-making standpoint, clinical diagnosis is opinion revi- 
sion with imperfect information, and treatment choice is about how 
best to balance benefits and harms. Risk and uncertainty are ev- 
erywhere. The aim.s of the research are to identify the processes 
people use in making complex judgments and choices under these 
conditions, and to ask whether their behaviors are consistent with 
Bayes theorem (for diagnostic reasoning) and maximizing expected 
utility (for treatment choices). If behavior is not consirrent with 
these principles, and if we find these principles sensible and ap- 
pealing, we might w’ell wonder what kind of educational program 
could he developed to improve our decision making. Therefore, 
there is just as much ptiint to studying cognitive heuristics and 
biases as there is to studying the roles of instances and prototypes 
in categorization. Indeed, rhe role of instances in categorization can 
be seen as a special case of base-rate neglect or of treating irrelevant 
data as strong evidence: in reality, some of the cues associated with 
the instance have likelihood ratios close to l.O (the decision-the- 
oretic definition of irrelevant), but arc treated as if they are mean- 
ingful, say >10.0. Using two very different theoretical frameworks, 
both of us have thrown some light on how the mind works, and 
we have shown that human inference can he improved up^m. To 
improve clinical decision making, it seems to me that decision the- 
ory is at least as promising as the study of categorization processes. 
I still think that a general strategy applicable to a wide range of 
clinical situations would he very useful in helping people to think 



straight. Norman referred to the finding of content specificity, dis- 
covered in my early research in this area.® Given this fact, the need 
for a general approach to sound thinking is even greater than we 
had previously suspected. 

What is the evidence that clinicians at times need help in think- 
ing about complex problems? Two related bodies of evidence, from 
cognitive psychology and from health services research, support this 
claim. From cognitive psychology, we have a series of lessons and 
findings about limited rationality. Health services researchers have 
provided a growing body of literature on practice variation (for 
example, see references 21 and 22), which has repeatedly shown 
that something besides hard science is involved in many medical 
decisions, both diagnostic and therapeutic, and that these varia- 
tions are not necessarily rational responses to differences between 
patients. 

How Can We Improve the Quality of Clinical Practice? 

Interestingly, in the past 20 years, two related decision technologies 
have arisen that deal precisely with these issues: evidence-based 
medicine (EBM) and decision analysis (DA). Both offer to the 
medical community ways of quantifying the evidence, dealing with 
uncertainty and error in the evidence, and tr\'ing to systematically 
weigh risks and benefits of alternative treatment strategies. The 
rapid dissemination of these principles may he attributed in part to 
the diligence and enthusiasm of their devotees, but it cannot be 
entirely explained by their efforts. The Zeitgeist or cultural climate 
had to be ready. In my view, psychological research on problem 
solving and decision making has contributed to these developments 
by showing that expert clinical judgment was not as expert as we 
had believed it to be, that knowledge transfer was more limited 
than we had hoped it would be, and that judgmental errors were 
neither limited to medical students nor eradicated by experience. 
EBM and DA offer approaches for dealing with these problems, 
and that is why they are making headway in clinical medicine. 
Clinical practice guidelines, which arc intended to improve the 
overall quality of care, arc another, related, approach to these issues, 
and the problems encountered in their dissemination and imple- 
mentation have been widely discussed.’’ 

The reactions to these approaches suggest that the tension be- 
tween theory and practice will remain. All theories and models are 
simplifications of reality. They abstract particular features in order 
to provide a reasonably coherent account of how things work and 
to guide action. That is precisely why they are useful. Models are 
not reality, however, and theory is not practice. Consequently, phy- 
sicians often mistrust the adequacy of scientific accounts or guide- 
lines based on evidence, despite the necessity of relying upon them. 
Because general principles will never he able to account for all 
concerns in clinical cases, there will always be room for judgment, 
applying general principles on a ca.se -hy'Ca.se basis. 

Encomium: Let Us Now Praise . . . 

Geoff left his comments about the connccrion to Michigan Stare 
University (MSU) and its College of Human Medicine for the close 
of his remarks, and I folK)w his example. How fortunate that many 
years ago, Geoff Norman came to MSU and joined our small group 
of scholars. We were not aware that we were doing classic work 
that would be argued and di.se ussed and revisired for a generatitin. 
Who could possibly have thought that? Yet, if there was ever a 
golden era of research in medical education, it was there and then. 
Wc have made some progress, and we have had a wonderful run. 
When I think of that medical school and its faculty and students 
back in the 70s, the wonderful line from Shakespeare’s Henry V' 
always comes to mind: “Wc few, wc happy few, wc band of 
brothers.” How appropriate that in Jack Maatsch’s memory we have 
come logcther to discuss some issues that concerned him and to 
celebrate that happy band! 
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• 1999 INVITED ADDRESS 



The Marvelous Medical Education Machine or How Medical Education Can Be Unstuck in Time 



CHARLES R FRIEDMAN 



The jumping-off point for this paper is actually the second part of 
its coaipound title. The concept of becoming “unstuck” in time 
stems froin the initial line of Kurt Vonnegut’s popular novel Slcugh- 
terhouse Five [Vonnegut 1969]: 

Listen: Billy Pilgrim has come unstuck in time. 

In this paper 1 will actually argue that medical education has be- 
come “stuck,” not only in time but also in space and content. It 
has become stuck in time because events considered to be educa- 
tional largely occur through interactions that require the learners 
and the faculty to be simultaneously patticipating in these inter- 
actions. It has become stuck in space because its mechanisms if 
delivery are largely bound to a specific physical location, the aCj- 
demic medical center with its classrooms and associated health care 
delivery venues, k has become stuck in content because the topics 
that are the focus of educational interactions ate insufficiently un- 
der the control of the students, and the teachers. Increasingly, there 
is no reason for any of these requirements to be imposed on the 
educational process. Moreover, medical education remains stuck in 
an era when much of the rest of human enterprise is becoming 
unstuck, the result of a sweeping set of cultural changes made pos- 
sible by information technology and primarily by the phenomenal 
proliferation of the global Internet [Drucker 1999]. 

I will further argue in this paper that medical education can 
gradually be “unstuck” in space, time, and content through appro- 
priate use of emerging technology, with emphasis on simulation 
methods that have become widespread in the use of training pilots 
and professionals in other disciplines. Modem flight simulators 
have become so sophisticated that experienced pilots being certified 
to fly a new aircraft might have a load of passengers in the back 
the first time they actually fly the plane (Dawson and Kauftnan 
1998]. While there will always he a pilot experienced flying this 
aircraft alongside the neophyte in the cockpit, this practice clearly 
testifies to the educational power of simulations. Recently, the U.S. 
Na ^7 adopted the inexpensive Microsoft “Flight Simulator”*® pro- 
gram as standard training for its new pilots, after a trainee who 
practiced extensively on this program recorded rhe best perfor- 
mance ever on an initial training flight [Brewin 2000] . 

The “marvelous medical education machine,” as the concept will 
be developed in this paper, is the complete simulator for medical 
education, analogous to the best of contemporary flight simulators. 
Bur like Vonnegut’s novel, the marvelous machine is currently a 
work of fiction. It does not exist although bits and pieces of it dn 
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exist, and these suggest what might be possible in the not-too- 
distant future. In the sections that follow, 1 will describe the need 
for the marvelous machine in greater detail, discuss what it can 
potentially do when built, expose the internal anatomy of the com- 
plete machine, review some of the pieces that exist now and how 
w'e might build it from here, and finally discuss some of the key 
educational research questions that will have to be illuminated 
along the way. This paper, in its entirety, will argue that building 
the marvelous machine should be a top priority for medical edu- 
cation nationally and internationally. 

Stuck in Space, Time, and Content 

To clarify what it means for medical education to be “stuck,” it 
may be useful to consider education as a process with events that 
exist in three dimensions (Figure 1). The first dimension can be 
thought of as physical space, the second time, and the third the 
biomedical topic that is under consideration. Medical education is 
stuck in all three dimensions, because teachers and learners have 
little control over these dimensions: where and when the events 
occur and what topics are addressed. In the basic sciences, for ex- 
ample, lectures and labs occur in a fixed place and at a scheduled 
time and on a topic that faculty believes the students need to know 
about — and then they are over. In the clinical sciences, patients 
(who remain the primary “teaching material” even though this 
term is seldom used anymore) appear at a fixed location and at a 
particular time with the problem they happen to have — and then 
they leave. 

This way of doing educational business is so much a part of daily 
life in an academic medical center that most of us take it for 
granted; and since our students learn and graduate and become 
certified as practitioners, it is easy to conclude that there is nothing 
wrong with being “stuck.” But there are profound reasons for con- 
cern. First and foremost, education that is stuck routinely ignores 
much of what is known about teaching and learning in medicine. 
Studies of clinical reasoning accumulated over more than 20 years 
point to the “case specificity” of medical expertise, meaning that 
proficiency generalizes very weakly from disease to disease and, 
more generally, from one aspect of medicine to another [Elstein et 
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al. 1978; Schmidt et al. 1990]. As such, the effective way, 
and perhaps the only way, of developing proficiency over time is 
active practice with a wide range of cases and with as many rope- 
titions for each subject/disease area as possible [Issenberg et al. 
1999]. In educational environments that are stuck, live patients are 
the primary- source of such practice; yet faculty' and students ha\e 
no control over the patients who walk into the clinic or are ad- 
mitted CO the hospital. Active, appropriate practice under these 
circumstances can be ver\’ difficult to engineer, much less guar- 
antee. 

Another problem is the expectations of a coming generation of 
learners that has increasingly “grown up digital” [Tapscott 1998|. 

Our students who have experienced increasingly sophisticated 
video games, and who ha\e spent hours with such excellent sim- 
ulations as Sim Cin-® and Flight Simulator®, will recognice im- 
mediately the potential for similar experiences to enhance their 
training in medical domains. Tliese learners will inruitively under- 
stand that medical education is stuck in space, time, and content. 
Although they may not use these exact words, they will find being 
stuck unacceptable. They may articulate this recognition by com- 
paring their medical education experience with their undergraduate 
experience, wondering why, as the sophistication level ot what they 
are studying is increasing, the ,;<‘>phistication of the technology* used 
to supp^-irt these studies is decreasing. In the short term, they may 
accept what they see as antediluvian educational practices, simply 
because these represent the only pathway to a desired profession, 
but over time they will demand a different kind of ser\*ice, the 
need for which and the practicality of which they see as self-evi- 
dent. If they cannot get this ser\*ice from traditional educational 
institutions, their instincts honed by the Internet culture will lead 
them to seek it from other sources. 

Economic pressures on academic medical centers may drive 
change as well. The problem of providing appropriate practice for 
trainees exacerbates as health care economics shortens hospital 
stays and clinic visits, and trainees necessarily have more limited 
access to patients. Clinical faculty members at academic medical 
centers and in community settings may perceive that their produc- 
tivity is judged much more by patient throughput than student 
learning. An educational system alrea'^y limited in its ability to 
provide an appropriate range of “teaching material” may find itself 
unable to provide appropriately motivated teachers as well. 

If academic medical centers do not systematically recognize the 
opportunity afforded hy information technology to “unstick” the 
system, others will. Hafferty has warned that, for a \*ariet>' of tea- 
stms, medical education based in academic centers could lose its 
social mandate by nor addressing in the curriculum a widely-rec- 
ognized set of social needs, and thus become irrele\ ant to the needs 
ot the mixiem world [Hafferty 1999 ], Similarly, bv remaining ob- 
stinately stuck in space, time, and content, academic medical cen- 
ters could lose what may be called their “technical mandate” to 
educate because the methods being used no longer make sense to 
trainees and to s<.>ciety as a whole. Simultaneous loss of social and 
technical mandates will generate alternative approaches to educa- 
tion that could, over time, become the norm. Such alternatives are 
already becoming evident, for example, in the Open University’s 
plan to offer a curriculum equivalent to the first two years of the 
medical curriculum in the United Kingdom [Daniel 1999 ], and pos- 
sibly through Internet ventures such as “medschool.com”® 
(Medschool.com 2000], E-^tablished academic medical centers can 
choose to be leaders and active partners in these developments, or 
not. 

Some may ask to what extent the technique of standardized or 
simulated live patients (Ainsworth et al. 1991], which has occupied 
much of the attention of the medical educatum research coinmu- 
niry over the past two decades, offers the capabilities of the mar- 
velous machine. It dtx?s, hut as a practical matter only to a ver\* 
limitevl extent. Standardized patients are expensive and do not offer 
the economies of scale that, as will be discussed, the marvelous 
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machine so profoundly offers. The largest expense associated with 
use of standardized patients is the wages they must be paid, and the 
20th standardized patient encountered hy a student costs almost as 
much as the first. Standardized patients must be painstakingly 
trained, and there are significant costs associated with this training 
that are completely lost once the patient retires from active edu- 
cational service. And a standardized patient can offer only limited 
variations on the case he/she was trained to represent. As a trainer 
for procedures, standardized patients must endure the mistakes of 
the non-expert. Invasive or risky procedures cannot ethically be 
performed on them at all. .-Although they can explain how they 
feel, standardized patients have no access to what is actually hap- 
pening inside their bodies, and cannot explain to trainees the con- 
sequences of their actions at the organic or cellular levels. Finally, 
standardized patients cannot easily record what is being done to 
them by the trainee, so feedback to trainees cannot be related with 
high precision directly to their actual actions and decisions. So 
while standardized patients can be enormously valuable sources of 
practice and tcKils for assessment, they take medical education only 
part of the way to where it can and needs to go. Tney are, for the 
most part, stuck in space, time, and content. 

Potential of the Marvelous Machine 

Remember aKwe all that the marvelous machine does not currently 
exist. As we consider what the future might hold if medical edu- 
cation Kgins a steady progres.sion toward the development of the 
man’clous machine, it is useful to visualize an end-point of this 
progression. 1 do not envision, ever, the complete elimination of 
teaching around live patients in the same sense, although some 
might disagree, that no no\*ice pilot is likely to receive a license 
without flying a real p*'lane. Nor does this work envision that neu- 
rosurgery* residents will perform their first operation solo after five 
vears of practice only on a simulator. I do. however, envision a 
future where medical trainees, and practitioners for their continu- 
ing education, spend increasingly large fractions of their time work- 
ing on computer-based simulators. The reas<.>ns for this are exam- 
ined below. Later sections explore what must be inside such a 
machine in order for it to do these things. 

The marvelous machine is unstuck in the three-dimensional ed- 
ucational space described earlier (see Figure 1) because it can pro- 
vide tireless practice of medical diagnosis, management, and clin- 
ical prcKedures. It is unstuck in the time dimension because it can 
be used anytime, for as long ,is the trainee wishes, and over and 
over again to provide the kind of meaningful repetition of tasks 
that is highly desirable. The machine is unstuck in the space di- 
mension because the ideal, fully developed machine can “go” or be 
accessed an\■^vhere. The Intemer can in principle bring the capa- 
bilities of the machine to a trainee ai home or on campus, any- 
where in the world. This capability has enormous implications for 
the future of medK\u education as it requires us to think of the 
medical schixil not much as a physical place but as a set of 
learning resources that can be delivered an\*where (Friedman 1996). 
Tlic machine is unstuck in the content dimension because it can 
address on demand topics and skills of faculty and/or student 
choice, creating appropriate variants of each case or topic to enable 
meaningful practice to occur. Ir can record even* element of what 
happened during a student’s work with a case — generating highly 
specific feedback to the learner on his/her performance and inform- 
ing student and faculty choices aKiut what further practice each 
student may need. 

A further key feature of tbc marvelous machine are the fonunate 
economics of its use. Once Jevcli^pcd and programmed, there 
minimal marginal cost attaching to in, operation. The contrast to 
standardized patients, who are paid a fixed sum per hour, is pamc- 
ularly striking in this regard. 

To understand the potential of the machine from :i somewhat 
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Figure 2. The anatomy o( the marvclouj. 
machine. 
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different perspective, consider the potential of such a device to 
engage learners in “what if” games, which are enormously educa- 
tional. Students can ask, and get answers from the machine, to the 
following classes of “what if” questions: 

■ What if 1 did the procedure again, just a little hit differently? The 
marvelous machine allows students to tinker in a way that en- 
ables them to hone their skills and judgment. A student can, for 
example, explore the consequences of perhaps giving a slightly 
stronger dose of a drug to a “patient” whose disease is being 
simulated by the machine. 

■ AS/Tiat if 1 did this in a way I know is wrong? Without tht- ma- 
chine, it is difficult to experience the consequences of mistakes 
as a way to learn to manage them. In the real clinical world, 
mistakes certainly cannot be purposely made, and when they 
occur occasionally by accident or oversight they are not often 
not recognized as such until long after their occurrence. With 
the machine, students can make mistakes on purpose, knowing 
that they are mistakes, so they can practice managing the con- 
sequences, or just to see w'hat happens. 

■ What if I did this 100 times in each of two different ways? One 
of the most educationally creative ways of using the mar\'eloLis 
machine may he to conduct an “instant clinical trial” by in- 
structing the machine to treat 100 instances of the “patient” one 
way and another 100 instances a different way. The models built 
into the mature machine, as will he discussed below, are neces- 
sarily and realistically probabilistic and the machine will there- 
fore reflect naturally occurring variability in the way organisms 
respond to drugs and other external stimuli. 

■ if biology worked just a hit differently? Used in this way, 
the machine can connect the basic and clinical .sciences. In a 
fully mature version of the machine, students can he giv'cn the 
capability of changing the parameters i^f the biological models 
thet drive the simulator. The potential of enhancing their un- 
derstanding of basic biology is significantly enhanced through the 
ability to see how organisms would act or react if the basic laws 
of biology were constructed just a bit differently from the way 
we believe they are. 

Based on this, for now, somewhat abstract conceptualization of 
the marvelous machine and using our imaginations to conceive 
what someday the machine will be able to do, consider how med- 
ical education must then be undertaken. Medical cducatii>n would 

Ai.nOEMU' Mbnu;iNF. , VoL. 75 



not look and feel at all as it does now. The rationale for “lockstep” 
learning wherein students procc ’ in unison through a relatively 
rigid curriculum would disappear almost completely, and likely with 
it would disappear the notion of a four-year curriculum. Indeed, 
lockstep learning can be seen as an administrative artifact of the 
lack of a mature mar\’elous machine. Students could, in principle, 
begin their predoctoral education whenever they were ready, and 
authorized, to do so. They w'ould end it when they had proved they 
had mastered the stated objectives of the curriculum. There might 
he no need to have students physically on the central campus most, 
or even some, of the time. Lectures certainly have their place as 
an educational medium, hut the current reliance on lecture.s as a 
primary mechanism for conveying information would no longer 
make sense. Perhaps, with the mar\'elous machine, we could return 
to the pre-Flexnerian concept of the part-time student without in- 
heriting the educational inadequacies of rhe pre-Flcxner eta. While 
this paper does not focus on continuing education of physicians 
and other health professionals, rhe needs in continuing education 
are such that the potential effects of the machine on this level of 
the educational continuum are similarly revolutionary’ [Barnes 
1998j. 

The Anatomy of the Marvelous Machine 

Now to more Technical specifics. How arc we going to build the 
machine? What is its anatomy |van Meurs et al. 1997], irs necessary 
component part.s? 

As illustrated in Figure 2, the marvelous machine can he seen 
as having five major comptmcnis, not counting the “learner” with- 
out whom the machine would have no piiq-iose. The specific tech- 
niques for dev'cloping each of these components arc beyond the 
scope of this paper, hut a later section discusses rhe academic dis- 
ciplines that contribute to each one. 

■ First and forcmo.st, the machine has a domain model, which is 
a mathematical description of the biological phenomena go\‘ern- 
ing the disease or body sub-system of interest. The domain mcxlel 
computes the state of the patient and the effects of the learners 
actions on the state of the patient. The mathematical domain 
model is what makes it possible for the man’clous machine to 
generate an endless supply of novel cases and other practice op- 
portunities, and it is what largely sets the marvelous machine 
apart from traditional simulation environments that use 
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“scripted” cases. Typically, these domain models have explicit 
probabilistic features that reflect the natural variability' in disease 
development and response to clinical interventions. 

■ Next is the clinical representation engine. This component is 
necessary because the output of the domain model is typically a 
set of numbers that must be translated into clinical obsen'ables: 
statements the patient would make about his/her disease (“1 feel 
tired all the time . . findings that could be appreciated on 
physical examination (“The patient is cyanotic . . .”), and test 
results (“Biopsy reveals a tumor . . .). 

■ The sensory pathways component takes the findings and creates 
portrayals of them that are actually seen, heard, or touched by 
the learner. This component can be seen as the virtual reality 
aspect of the machine [Hoffman and Vu 1997; Satava and Jones 
1998). In a mature version of the machine, the learner will see 
the patient and hear his/her statements; and experience his/her 
physical condition through sight, touch, and hearing. All of 
these presentations would change as the patient’s condition 
changed, as directed by the domain model in response to actions 
taken by the learner and/or a natural evolution of the patient 
condition. The changes might occur in real time, as would be 
the case if the learner was using the machine to practice a pro- 
cedure, or in compressed (simulated) time if the learner was us- 
ing the machine to practice longitudinal management of a 
chronic disease. 

■ The scoring model is the basis of providing performance feed- 
back to the learner. Although there are other ways of approach- 
ing this problem, the scoring model typically would compute 
what is the ideal action for the learner to take at any point in 
the simulation and compute an instantaneous “score” for learner 
through a metric that compares the ideal performance with what 
the learner actually did. In some versions of the machine, the 
knowledge encoded in the domain model can also be harnessed 
to power the scoring model. 

■ Lastly, a complete educational application using the machine 
must have a curriculum model. Since the machine’s domain 
model can support learners’ practice by constructing cases with 
specific problems and other characteristics, the curriculum model 
would represent the set of problems and characteristics on which 
all learners must have practice, and in which order. For each 
learner, the cuniculum model would maintain records of which 
aspects of practice had actually occurred. 

To illustrate how these components of the machine would in- 
teract to generate a comprehensive practice experience, we could 
follow the machine through one conceptual cycle of operation. 
This example is a bit simplistic, but illustrative of the concepts. 

Ms. Smith, a medical student, is taking a rotation in clinical 
oncology and indicates to the machine that she wants to practice 
on a simulated case. The process begins with the curriculum model 
determining that she has not completed her minimum quota of 
practice on managing metastatic breast cancer. The machine may 
ask Ms. Smith at that point if she would like some further practice 
in breast cancer management. After an affirmative response, the 
domain model then generates a case of metastatic breast cancer, 
represented mathematically, subject to the constraints passed to it 
by the curriculum model. Because the domain model is inherently 
probabilistic, many features of the case presented to Ms. Smith arc 
determined by chance and no two cases would be exactly the same. 
The clinical representation engine then converts the initial state 
of the patient to a set of clinical findings that can be made known 
to Ms. Smith, should she request them as part of her initial work- 
up of the patient. 

Ms. Smith’s work then begins. She is told that the patient is in 
her “clinic” and takes a history, performs an exam and runs tests 
on the patient. Only those patient findings actually requested by 
Ms. Smith would be revealed to her. This is mediated through the 
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sensory pathways component of the machine. Ms. Smith would 
hear the patient’s voice responding to questions, see (and, depend- 
ing on the maturity of the virtual reality component of the ma- 
chine, perhaps feel) the areas affected by the patient’s previous 
surgery, and see the results of lab tests and imaging studies indi- 
cating metastatic disease. Based on Ms. Smith’s initial work-up, she 
then puts the patient on a regimen of chemotherapy. 

The domain model then computes the effects of the chemo- 
therapy on the course of the patient’s disease, mathematically mod- 
eling the growth of tumor cells, the reactions of these to the ther- 
apy, and any toxicity that may result from the therapy. The scoring 
model, in the meantime, has assigned and recorded a score (or 
scores) to the actions Ms. Smith has taken. 

Assuming that the domain model determines that Ms. Smith’s 
therapeutic regimen would cause toxicity, Ms. Smith would en- 
counter the patient again v./hen that toxicity had developed to the 
point that the patient would be symptomatic and would return to 
the clinic. At this later point in simulated time, the domain model 
will have generated a new set of mathematical parameters describ- 
ing the patient’s updated condition, and will have passed them to 
the clinical representation engine. The cycle of the machine’s op- 
eration continues with Ms. Smith having the opportunity to ex- 
amine the patient again, run more tests, and make decisions to 
manage the toxicity. Those decisions would be assigned a score, 
and die simulation would continue until the exercise was com- 
pleted. Ms. Smith might indicate to the machine that she was 
finished, or the patient might die or become disease free after a 
sufficient period for the domain model to conclude a probable cure. 
It would then be possible for Ms. Smith to initiate a dialog with 
the scoring model, which would present her score and critique her 
performance. If Ms. Smith wished, she could run the simulation 
clock back to a point where her performance was sub-optimal, and 
play a “what iF’ game by trying something different and experi- 
encing the consequences of her revised actions. 

How the Machine Will Be Built 

Is the example above science fiction? Partially, but on balance, not. 
Indeed, a primative version of the simulator described above (see 
Figure 3) has been developed through the OncoTCap project at 
the University of Pittsburgh [Day ct al, 1998]. A key innovative 
element of OncoTCai the development for many specific areas 
of oncology of a domain model that is powerful enough to drive a 
simulation of the type described. Indeed, OncoTCap’s domain 
model allows students to play all the “what if” games described 
earlier. They can try again, just to see if they can do better; they 
can do something wrong on purpose, just to see what happens or 
to practice managing the consequences, they can instruct the do- 
main model to run two types of treatment, each with 100 simulated 
patients, to run a “clinical trial” to see which method is superior; 
and they can even change the parameters of the domain model to 
see what the world would be like if biology worked a bit differently 
than science currently thinks it docs. 

Other notable efforts to build elements of the marvelous machine 
are described below. Still, it is safe to say that building the mar- 
x-eious medical education machine is rocket science. It is much 
harder than building a flight simulator, in part because of a major 
difference between aviation and medicine. As Dawson and Kauf- 
man (1998) have observed, in medicine one must manipulate the 
environment whereas in aviation the goal is to avoid it. The prob- 
lems of realistically representing clinical findings, and their 
evolution over time in the same patient, are enormous. When one 
loads, on top of that, the virtual reality aspects of creating the 
sensation of actually interacting with the patient through all senses, 
the full magnitude of the challenges that lie ahead begins to come 
clear. 

So how will the machine be built? First of all, it will be built 



Ac; PEMic Medicine, Vol. 75, No. 1 0 / Octoi^i^^pplement 2000 



S140 




Fi^wrc 3. Screen shot of the OncoTCap simulator which has been linked to a clinical presentation engine. 



incrementally. Pieces of it already exist; other pieces will arrive in 
the near future; and in some sense it will never he complete. It will 
just get better and better over time. Second, it will be built domain' 
by-domain. The comprehensive unified mathematical model of hu- 
man biology, the "Maxwell’s Equations of biology," probably do not 
exist and, if they do, they are not likely to he discovered anytime 
soon. What we arc therefore likely to see in near future are cancer 
simulators, anesthesiology simulators, diabetes simulators, surgical 
simulators, etc. These domain-specific simulations will become in- 
creasingly sophisticated, and then at some point in the future, the 
models will become sufficiently powerful that simulations from dif- 
ferent domains will begin to merge. Finally, the marvelous machine 
will be developed through collaborations among clinical domain 
experts and scientists in various disciplines. The clinical represen- 
tation and sensory pathway components are problems that fall to 
computer scientists and engineers; domain modeling is work for 
computational biologists and researchers in artificial intelligence; 
the scoring and curriculum models arc the purview of psychomc- 
tricians and decision analysts. 

Collaborative efforts to develop components of the marvelous 
machine abound. Many collaborations, some of which have created 
mature products, have been ongoing for many years. To cite just a 
few examples, two models of simulators for anesthesiology have 
reached a high level of development [Norman and Wilkins 1996]. 
A group in London has developed a prototype mar\'clous machine 
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for diabetes [Lehmann 1998). Groups at Stanford and UC-San 
Diego have taken important strides in developing anatomical sim- 
ulations that are the basis for building practice on clinical proce- 
dures into the marvelous machine [Hoffman and Vu 1997; Dev et 
al, 1998], as have groups at Mayo and Walter Reed Hospital in the 
specific area of GI procedures and endoscopy [Robb 1997]. It is 
important to acknowledge the CBX (Computer Based Exam) proj- 
ect of the National Board of Medical Examiners, which has created 
a comprehensive simulation environment of the U.S. medical cer- 
tification process [Clauser ct al. 1998] and an effort underway at 
the American Board of Family Practice to develop simulations 
driven by mathematical models [Sumner et al. 1998]. Algorithms 
that would drive a scoring module of the type described above have 
also been developed (Downs et al. 1997]. 

So while the marvelous machine as a whole does not exist, it is 
vei^- safe to say that significant bit and pieces of it do exist and 
there are substantial reasons to believe that it can and will be built 
over time. 

Conclusion: The Educational Research Challenges 

A final piece of the challenge of the marv^elous machine is the set 
of educational research questions that must be addrc,sscd if the ma- 
chine is going to be built properly and its value and place in med- 
ical education thoroughly understood. By this I do not mean the 
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myriad of technical research challenges that will have to be over- 
come to build the domain models, scoring models, clinical repre- 
sentation engines, and sensory pathways to the learner. As discussed 
earlier, these fall properly into the research areas of computer sci' 
cnce and engineering, computational biology, and other fields. 

From an educational research perspective, the key questions map 
out uncharted territoiy because of the novelty of what the machine 
can do. To the extent that the machine represents new technology 
with the potential to he of benefit, this does not mean that the 
machine ivill be of benefit. As with any technology’, there is poten- 
tial for it to leave us less well off than w’e were before. In the end, 
no matter how well the technology itself functions, the success of 
the machine will depend on how the machine is used, the educa- 
tional engineering of the machine into a comprehensive learning 
environment in which the machine is but one element. As noted 
earlier, teaching around live patients is not going to go away, no 
matter how sophisticated the machine becomes over time. Even 
though the live lecture as an educational medium is completely 
stuck in space and time and content, the live lecture will likely 
prove more durable than its most strident critics would have us 
believe. The proper use and integration of the machine into med- 
ical education can be directed profoundly by research that addresses 
questions such as: 

■ Relative to the domain model, how “good” do these models have 
to be in order for them to be ready for use in education. For 
educational purposes it is perhaps sufficient for the domain model 
to create and evolve cases that arc plausible, but not absolutely 
correct [Friedman 1995]. But how plausible is plausible enough? 

■ Relative to the curriculum model, how should an unstuck cur- 
riculum be structured? With freedom to learn anywhere, anytime, 
and on topics of student and/or faculty choice, how much free- 
dom is the right amount of freedom? What should be con- 
strained? To what extent should the domain learning model be 
one of discuvery'-oriented? How should more, or perhaps less, 
freedom be granted ro learners as their experience and expertise 
accumulate over time? 

■ Relative to the scoring model, all of the reproducibility and va- 
lidity issues that arise with any new assessment technique arise 
with the marvelous machine as well. The score a student receives 
for working one simulated case — or a battery’ of cases comprising 
a certification examination — has to be meaningful. Other ques- 
tions relating to the scoring model relate to the structure of feed- 
back. What models for presenting feedback to learners, during 
and after case, are most facilitative of learning? 

This surface glance at the important educational research ques- 
ions that attach to the marvelous machine brings this paper to its 
closing plea. Perhaps this is a plea that is totally unnecessary, hut 
the potential role of the marvelous machine in medical education 
seems so important that the research community should address 
itself to it sooner rather than later. Much of the needed educational 
research can be applied formatively to guide ongoing developmen- 
tal efforts, and it is not too early to get started. The biggest mistake 
at this point would be to view the machine parochially as a tech- 
nical undertaking, leaving its development solely to the "techies” 
until some point in the future, hy which time many key opportu- 
nities may be lost. 

So in some sense, the educational research community’ faces, on 
a smaller scale, the same challenge posed by the marvelous machine 
to the medical education community as a whole. The machine is 
coming; it is inevitable. It will gradually and by dint of great cre- 
ative effort unstick medical education in space, time, and content. 
Those who ignore It run the risk of becoming irrelevant; those who 
embrace it can do enormous good for the profession and, ultimately, 
tor the health of the public we all serve. 
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